Skip to main content
Curated Snapshot · May 18, 2026

AI model benchmarks: ChatGPT vs Claude vs Gemini vs Grok.

Compare 62 frontier and open-weight AI models across benchmarks, pricing, and context windows, then evaluate open-source agent stacks like OpenClaw, Hermes Agent, OpenCode, Cline, and DeerFlow on practical workflow fit. Watchlist rows are separated from verified scores so new model listings do not imply fake rankings.

62 Models 47 scored · 15 watchlist 8 Agent scores42 Proprietary · 20 Open
Best Overall
GPT-5.5 High

OpenAI

98.1 overall
Best Coding
Claude Opus 4.7

Anthropic

96.4% HumanEval
Best Open Source
Kimi K2.6

Moonshot AI

88.5 overall
Best Value
Muse Spark

Meta

$0/1M tokens
Agent workflow benchmark · May 6, 2026

Open-source agent workflow leaderboard

This NeuralStackly score combines public GitHub signals, docs, setup ergonomics, autonomy, memory, security posture, ecosystem depth, and developer workflow fit. It is a practical adoption score, not a vendor-reported model benchmark.

Open-source repos

8

Agents scored

954k

Repo stars tracked

91.0

Top score

#1OpenClawpersonal agent

369k GitHub stars

91.0

Always-on personal agents, multi-channel automation, skills, and OpenClaw ecosystem searches.

#2Hermes Agentpersonal agent

135k GitHub stars

90.5

Self-improving long-running agents with persistent memory, cron, skills, and provider routing.

#3OpenCodecoding agent

156k GitHub stars

88.0

Terminal-native coding agent workflows with provider control and open-source ergonomics.

#4Clinecoding agent

61k GitHub stars

84.5

IDE-based coding agents that ask permission before editing files or executing commands.

AgentStarsSetupAutonomyMemorySecurityEcosystemScore
369k729482849991.0
135k789295868890.5
156k868876849088.0
#4Cline
61k848472868284.5
65k689084908283.5
#6Goose
44k828476847882.0
#7Browser Use
92k867858788879.0
#8LangGraph
31k768278828478.5

Curated Leaderboard

Ranks are unique for scored rows. Newly listed models show as watchlist until public benchmark data is available.

RankModelCompanyMMLUHumanEvalMATHGPQAArena ELOLiveOverallCost/1MContext
1NEW
GPT-5.5 HighOpenAI
89.1
89.7
93.2
81.5
148498.1$2.49400K
2NEW
Claude Opus 4.7Anthropic
95.6
96.4
98.2
91.0
149294.2$15.001M
32
Gemini 3.1 ProGoogle
94.3
93.2
98.1
94.1
149394.2$2.501M
43
GPT-5.4OpenAI
93.8
96.3
97.9
92.8
146794.2$2.501M
51
Claude Opus 4.6Anthropic
94.2
95.1
97.6
90.5
149890.3$15.001M
6NEW
GPT-5.3 CodexOpenAI
91.2
95.8
96.2
84.8
130088.5$3.00400K
7NEW
Kimi K2.61T (MoE 32B active)Moonshot AI
91.8
90.1
96.5
85.4
152988.5$0.95256K
8NEW
Grok 4.20xAI
91.2
91.5
95.1
79.4
148188.0$3.00256K
92
GPT-5OpenAI
91.5
92.1
97.9
88.1
147587.1$1.25400K
102
Claude Opus 4.5Anthropic
92.8
94.2
96.1
87.0
146986.2$5.001M
11
Claude Sonnet 4.6Anthropic
91.8
93.8
97.8
89.9
146884.6$6.001M
12NEW
DeepSeek V4 ProDeepSeek146384.6$0.431M
13NEW
Muse SparkMeta
91.8
92.1
95.4
80.7
149084.6$0.00262K
14NEW
Qwen3.6 MaxAlibaba
90.5
90.8
94.6
77.2
136284.6$1.60128K
152
Claude Sonnet 4.5Anthropic
89.1
90.8
97.7
83.4
145383.8$3.001M
165
GLM-5.1?Zhipu AI
89.7
88.1
94.5
86.2
153482.7$2.15200K
171
GPT-5.2OpenAI
92.1
94.8
98.0
91.4
143681.5$1.75400K
181
GLM-5?Zhipu AI
91.2
90.5
95.3
77.8
147080.8$1.00200K
19NEW
MiniMax M2.7?MiniMax
89.2
86.8
93.5
82.1
125080.8$0.32205K
203
DeepSeek V3671B (MoE 37B active)DeepSeek
85.1
84.2
88.5
72.4
142479.4$0.27128K
212
Grok 4xAI
88.0
87.5
93.2
81.8
146078.8$3.002M
223
GPT-4.1OpenAI
88.2
91.5
93.8
80.5
141377.7$2.001M
23NEW
DeepSeek V4 FlashDeepSeek143375.0$0.141M
24
Gemini 3 FlashGoogle
90.8
89.6
95.2
90.4
147373.1$0.501M
25NEW
Gemini 3 Flash PreviewGoogle73.1$0.501M
263
Claude 3.5 SonnetAnthropic
82.1
85.8
84.1
65.3
137271.5$3.00200K
272
GPT-4oOpenAI
83.5
86.5
85.2
68.4
134567.4$2.50128K
28NEW
DeepSeek V3.2685B (37B active, MoE)DeepSeek
89.5
89.3
93.2
75.8
142365.4$0.28128K
291
Gemini 3 ProGoogle
93.5
92.0
98.5
91.9
148663.5$2.001M
303
Gemma 4 31B31BGoogle
80.5
79.8
83.8
68.5
95059.6$0.10128K
31NEW
OpenAI o3-miniOpenAI57.7$1.10200K
324
Kimi K2.5?Moonshot AI
90.4
89.3
96.8
87.6
143255.8$0.60262K
33NEW
Nemotron 3 Super120B (12B active)NVIDIA
85.3
84.2
88.5
68.9
136153.8$0.20128K
344
Gemini 2.5 ProGoogle
87.8
88.4
92.3
79.8
144751.9$1.251M
35NEW
gpt-oss-120B120BOpenAI
83.1
81.6
86.2
65.4
122048.1$0.15128K
365
Grok 3xAI
84.5
83.5
87.8
72.1
136246.2$3.00131K
37
Qwen 3.5 122B122B (MoE A10B)Alibaba
84.2
84.8
90.1
86.6
113435.5$0.40262K
38NEW
GPT-5.4 miniOpenAI
87.2
88.5
91.5
79.8
145628.8$1.69400K
394
Mistral LargeMistral AI
81.2
80.1
82.5
62.1
141528.8$2.00128K
402
Qwen 3.5 397B397B (MoE A17B)Alibaba
89.5
88.9
95.6
88.4
106725.3$0.60262K
413
Command R+104BCohere
75.2
75.0
72.8
55.2
106024.2$2.50128K
424
Mixtral 8x22B141B (MoE 39B active)Mistral AI
72.8
71.5
68.2
48.5
103019.7$0.6565K
431
Llama 4 Maverick400B (MoE 17B active)Meta
87.2
86.1
91.5
78.2
132719.2$0.2010M
442
MiniMax M2.5?MiniMax
83.8
83.5
89.2
74.8
102018.2$0.201M
456
Llama 3.1 405B405BMeta
79.8
79.2
78.5
60.8
133317.3$3.00128K
46
Llama 4 Scout109B (MoE 17B active)Meta
82.5
82.3
86.1
70.5
132211.5$0.1110M
472
Phi-414BMicrosoft
78.5
82.5
80.8
61.5
10803.8$0.0716K
WatchNEW
Devstral 2512benchmark pendingMistralpending$0.40256K
WatchNEW
GPT-5.5benchmark pendingOpenAIpending$5.001.05M
WatchNEW
GPT-5.5 Probenchmark pendingOpenAIpending$30.001.05M
WatchNEW
Grok 4.3benchmark pendingxAIpending$1.251M
WatchNEW
MiniMax 01benchmark pendingMiniMaxpending$0.201M
WatchNEW
Mistral Large 2512benchmark pendingMistralpending$0.50256K
WatchNEW
Mistral Medium 3.5benchmark pendingMistral AIpending$1.50262K
WatchNEW
OpenAI o3benchmark pendingOpenAIpending$2.00200K
WatchNEW
OpenAI o3-probenchmark pendingOpenAIpending$20.00200K
WatchNEW
OpenAI o4-minibenchmark pendingOpenAIpending$1.10200K
WatchNEW
Qwen3.6 Flashbenchmark pendingAlibabapending$0.251M
WatchNEW
Qwen3.6 Max Previewbenchmark pendingAlibabapending$1.04262K
WatchNEW
Qwen3.6 Plusbenchmark pendingAlibabapending$0.331M
WatchNEW
Qwen3 Coder Plusbenchmark pendingQwenpending$0.651M
WatchNEW
Qwen3 Maxbenchmark pendingQwenpending$0.78256K

62 models · Sorted by Rank (ascending)

Compare Models

Select up to 3 models to see side-by-side benchmark comparisons.

Select up to 3 models to compare:

GPT-5.5 High
Claude Opus 4.7
Gemini 3.1 Pro
MMLUMassive Multitask Language Understanding
GPT-5.5 High
89.1
Claude Opus 4.7
95.6
Gemini 3.1 Pro
94.3
MMLU-ProHarder version of MMLU with 10 choices per question (vs 4) and more complex reasoning required
GPT-5.5 High
82.3
Claude Opus 4.7
92.3
Gemini 3.1 Pro
91.0
HumanEvalCoding benchmark with 164 Python programming challenges
GPT-5.5 High
89.7
Claude Opus 4.7
96.4
Gemini 3.1 Pro
93.2
MATHCompetition-style mathematics problems across algebra, geometry, number theory, and more
GPT-5.5 High
93.2
Claude Opus 4.7
98.2
Gemini 3.1 Pro
98.1
GPQA DiamondGraduate-level science questions in physics, chemistry, and biology
GPT-5.5 High
81.5
Claude Opus 4.7
91.0
Gemini 3.1 Pro
94.1
SimpleQAFactual accuracy benchmark testing whether models give correct, concise answers to straightforward questions
GPT-5.5 High
36.2
Claude Opus 4.7
45.1
Gemini 3.1 Pro
44.2
MetricGPT-5.5 HighClaude Opus 4.7Gemini 3.1 Pro
Overall Score98.194.294.2
Cost/1M In$2.49$15$2.5
Cost/1M Out$9.98$75$15
Context Window400K1M1M
Categoryproprietaryproprietaryproprietary

Methodology

Last updated: May 18, 2026

Our benchmark scores are compiled from independently-run evaluations by Epoch AI, Scale AI, LMSYS Chatbot Arena, and official model technical reports. We prioritize independently verified scores over self-reported numbers where available. All scores represent the best reported result for each model on each benchmark.

MMLU

Massive Multitask Language Understanding. Tests knowledge across 57 subjects including STEM, humanities, and social sciences. 14,000+ multiple-choice questions.

MMLU-Pro

Harder version of MMLU with 10 choices per question (vs 4) and more complex reasoning required. Better at differentiating top models.

HumanEval

Coding benchmark with 164 Python programming challenges. Tests function implementation from docstrings. Reports pass@1 rate.

MATH

Competition-style mathematics problems across algebra, geometry, number theory, and more. Level 5 (hardest) used for frontier model comparison.

GPQA Diamond

Graduate-level science questions in physics, chemistry, and biology. PhD experts validate questions; random guessing yields ~25%.

Arena ELO

Chatbot Arena rating from LMSYS. Real humans vote on model outputs in blind head-to-head comparisons. Measures conversational quality.

SimpleQA

Factual accuracy benchmark testing whether models give correct, concise answers to straightforward questions. Measures hallucination rate.

How We Calculate Overall Score

The overall score is a weighted composite of normalized benchmark results:

  • MMLU — 15% weight (breadth of knowledge)
  • MMLU-Pro — 15% weight (harder reasoning)
  • HumanEval — 20% weight (coding ability)
  • MATH — 15% weight (mathematical reasoning)
  • GPQA Diamond — 15% weight (scientific reasoning)
  • Arena ELO — 10% weight (real-world preference)
  • SimpleQA — 10% weight (factual accuracy)

Data Sources

Arena ELO: live · May 7, 2026
Artificial Analysis / public benchmark pages where availableLMSYS / Chatbot Arena where availableOpenRouter model availability and pricingOfficial provider model cardsLMSYS Arena (live)

About These Benchmarks

Benchmarks measure specific capabilities under controlled conditions and may not fully represent real-world performance. Different prompting strategies, system prompts, and tool use can significantly change results. We recommend trying models on your specific tasks using our tool finder or the AI tools directory.

AI Model Benchmarks: What You Need to Know in 2026

The AI landscape in 2026 is dominated by a handful of frontier models that push the boundaries of reasoning, coding, and scientific knowledge. Our LLM leaderboard tracks the latest benchmark scores from independent evaluations, giving you an objective way to compare models like GPT-5.5 High, Claude Opus 4.7, and Gemini 3.1 Pro.

Whether you're choosing a model for coding assistance, research, or general chatbot use, benchmarks like MMLU (knowledge), HumanEval (coding),GPQA Diamond (science), and Chatbot Arena ELO (human preference) provide concrete data points for comparison.

Key Takeaways for May 2026

  • GPT-5.5 High remains the top scored row in this snapshot; Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.4 follow closely.
  • New model listings like GPT-5.5, Grok 4.3, Mistral Medium 3.5, Qwen3.6 variants, and DeepSeek V4 Flash are included as watchlist rows until public benchmark data is available.
  • Coding comparisons should use HumanEval and model-specific coding evals alongside price and context window — raw overall rank is not enough.
  • Agent workflow scores now belong beside model benchmarks because production fit depends on setup time, memory, security, and ecosystem quality.
  • Open-weight models remain strongest when privacy, local deployment, and cost control matter more than absolute frontier performance.

Fast shortlists by use case

Explore our full AI tools directory for detailed reviews of each model, or use the AI Tool Finder to discover which model best fits your needs.

AI benchmark FAQ

Which AI model is best overall in this benchmark snapshot?

The top scored row changes as public benchmark data updates. NeuralStackly separates verified benchmark rows from watchlist models so newly listed models do not receive fake rankings before public data is available.

Should I choose ChatGPT, Claude, Gemini, or Grok from benchmark scores alone?

No. Benchmark scores are useful for shortlisting, but production choice should also consider coding reliability, context window, price, privacy, rate limits, tool integrations, and the workflow you need to automate.

Why are AI agent workflow scores included beside model benchmarks?

Software teams increasingly choose full agent stacks, not only base models. Setup time, memory, sandboxing, ecosystem quality, and security posture affect whether an AI stack is actually usable in production.

How often is this AI benchmark page updated?

The benchmark page is reviewed frequently, with priority on frontier model launches, open-weight model releases, coding benchmark updates, and agent framework changes that affect developers.