AI model benchmarks: ChatGPT vs Claude vs Gemini vs Grok.
Compare 62 frontier and open-weight AI models across benchmarks, pricing, and context windows, then evaluate open-source agent stacks like OpenClaw, Hermes Agent, OpenCode, Cline, and DeerFlow on practical workflow fit. Watchlist rows are separated from verified scores so new model listings do not imply fake rankings.
OpenAI
Moonshot AI
Meta
Open-source agent workflow leaderboard
This NeuralStackly score combines public GitHub signals, docs, setup ergonomics, autonomy, memory, security posture, ecosystem depth, and developer workflow fit. It is a practical adoption score, not a vendor-reported model benchmark.
8
Agents scored
954k
Repo stars tracked
91.0
Top score
369k GitHub stars
Always-on personal agents, multi-channel automation, skills, and OpenClaw ecosystem searches.
135k GitHub stars
Self-improving long-running agents with persistent memory, cron, skills, and provider routing.
156k GitHub stars
Terminal-native coding agent workflows with provider control and open-source ergonomics.
61k GitHub stars
IDE-based coding agents that ask permission before editing files or executing commands.
| Agent | Stars | Setup | Autonomy | Memory | Security | Ecosystem | Score |
|---|---|---|---|---|---|---|---|
| 369k | 72 | 94 | 82 | 84 | 99 | 91.0 | |
| 135k | 78 | 92 | 95 | 86 | 88 | 90.5 | |
| 156k | 86 | 88 | 76 | 84 | 90 | 88.0 | |
| 61k | 84 | 84 | 72 | 86 | 82 | 84.5 | |
| 65k | 68 | 90 | 84 | 90 | 82 | 83.5 | |
| 44k | 82 | 84 | 76 | 84 | 78 | 82.0 | |
| 92k | 86 | 78 | 58 | 78 | 88 | 79.0 | |
| 31k | 76 | 82 | 78 | 82 | 84 | 78.5 |
What OpenClaw owns
Search demand, ecosystem gravity, chat channels, skills, and broad personal-agent automation.
What Hermes owns
Memory, scheduled work, self-improvement, terminal workflow, and multi-provider routing.
What to test next
End-to-end agent tasks: setup time, secrets handling, sandbox boundaries, task completion, and rollback.
Curated Leaderboard
Ranks are unique for scored rows. Newly listed models show as watchlist until public benchmark data is available.
| Rank | Model | Company | MMLU | HumanEval | MATH | GPQA | Arena ELOLive | Overall | Cost/1M | Context |
|---|---|---|---|---|---|---|---|---|---|---|
1NEW | GPT-5.5 High | OpenAI | 89.1 | 89.7 | 93.2 | 81.5 | 1484 | 98.1 | $2.49 | 400K |
2NEW | Claude Opus 4.7 | Anthropic | 95.6 | 96.4 | 98.2 | 91.0 | 1492 | 94.2 | $15.00 | 1M |
32 | Gemini 3.1 Pro | 94.3 | 93.2 | 98.1 | 94.1 | 1493 | 94.2 | $2.50 | 1M | |
43 | GPT-5.4 | OpenAI | 93.8 | 96.3 | 97.9 | 92.8 | 1467 | 94.2 | $2.50 | 1M |
51 | Claude Opus 4.6 | Anthropic | 94.2 | 95.1 | 97.6 | 90.5 | 1498 | 90.3 | $15.00 | 1M |
6NEW | GPT-5.3 Codex | OpenAI | 91.2 | 95.8 | 96.2 | 84.8 | 1300 | 88.5 | $3.00 | 400K |
7NEW | Kimi K2.61T (MoE 32B active) | Moonshot AI | 91.8 | 90.1 | 96.5 | 85.4 | 1529 | 88.5 | $0.95 | 256K |
8NEW | Grok 4.20 | xAI | 91.2 | 91.5 | 95.1 | 79.4 | 1481 | 88.0 | $3.00 | 256K |
92 | GPT-5 | OpenAI | 91.5 | 92.1 | 97.9 | 88.1 | 1475 | 87.1 | $1.25 | 400K |
102 | Claude Opus 4.5 | Anthropic | 92.8 | 94.2 | 96.1 | 87.0 | 1469 | 86.2 | $5.00 | 1M |
11 | Claude Sonnet 4.6 | Anthropic | 91.8 | 93.8 | 97.8 | 89.9 | 1468 | 84.6 | $6.00 | 1M |
12NEW | DeepSeek V4 Pro | DeepSeek | — | — | — | — | 1463 | 84.6 | $0.43 | 1M |
13NEW | Muse Spark | Meta | 91.8 | 92.1 | 95.4 | 80.7 | 1490 | 84.6 | $0.00 | 262K |
14NEW | Qwen3.6 Max | Alibaba | 90.5 | 90.8 | 94.6 | 77.2 | 1362 | 84.6 | $1.60 | 128K |
152 | Claude Sonnet 4.5 | Anthropic | 89.1 | 90.8 | 97.7 | 83.4 | 1453 | 83.8 | $3.00 | 1M |
165 | GLM-5.1? | Zhipu AI | 89.7 | 88.1 | 94.5 | 86.2 | 1534 | 82.7 | $2.15 | 200K |
171 | GPT-5.2 | OpenAI | 92.1 | 94.8 | 98.0 | 91.4 | 1436 | 81.5 | $1.75 | 400K |
181 | GLM-5? | Zhipu AI | 91.2 | 90.5 | 95.3 | 77.8 | 1470 | 80.8 | $1.00 | 200K |
19NEW | MiniMax M2.7? | MiniMax | 89.2 | 86.8 | 93.5 | 82.1 | 1250 | 80.8 | $0.32 | 205K |
203 | DeepSeek V3671B (MoE 37B active) | DeepSeek | 85.1 | 84.2 | 88.5 | 72.4 | 1424 | 79.4 | $0.27 | 128K |
212 | Grok 4 | xAI | 88.0 | 87.5 | 93.2 | 81.8 | 1460 | 78.8 | $3.00 | 2M |
223 | GPT-4.1 | OpenAI | 88.2 | 91.5 | 93.8 | 80.5 | 1413 | 77.7 | $2.00 | 1M |
23NEW | DeepSeek V4 Flash | DeepSeek | — | — | — | — | 1433 | 75.0 | $0.14 | 1M |
24 | Gemini 3 Flash | 90.8 | 89.6 | 95.2 | 90.4 | 1473 | 73.1 | $0.50 | 1M | |
25NEW | Gemini 3 Flash Preview | — | — | — | — | — | 73.1 | $0.50 | 1M | |
263 | Claude 3.5 Sonnet | Anthropic | 82.1 | 85.8 | 84.1 | 65.3 | 1372 | 71.5 | $3.00 | 200K |
272 | GPT-4o | OpenAI | 83.5 | 86.5 | 85.2 | 68.4 | 1345 | 67.4 | $2.50 | 128K |
28NEW | DeepSeek V3.2685B (37B active, MoE) | DeepSeek | 89.5 | 89.3 | 93.2 | 75.8 | 1423 | 65.4 | $0.28 | 128K |
291 | Gemini 3 Pro | 93.5 | 92.0 | 98.5 | 91.9 | 1486 | 63.5 | $2.00 | 1M | |
303 | Gemma 4 31B31B | 80.5 | 79.8 | 83.8 | 68.5 | 950 | 59.6 | $0.10 | 128K | |
31NEW | OpenAI o3-mini | OpenAI | — | — | — | — | — | 57.7 | $1.10 | 200K |
324 | Kimi K2.5? | Moonshot AI | 90.4 | 89.3 | 96.8 | 87.6 | 1432 | 55.8 | $0.60 | 262K |
33NEW | Nemotron 3 Super120B (12B active) | NVIDIA | 85.3 | 84.2 | 88.5 | 68.9 | 1361 | 53.8 | $0.20 | 128K |
344 | Gemini 2.5 Pro | 87.8 | 88.4 | 92.3 | 79.8 | 1447 | 51.9 | $1.25 | 1M | |
35NEW | gpt-oss-120B120B | OpenAI | 83.1 | 81.6 | 86.2 | 65.4 | 1220 | 48.1 | $0.15 | 128K |
365 | Grok 3 | xAI | 84.5 | 83.5 | 87.8 | 72.1 | 1362 | 46.2 | $3.00 | 131K |
37 | Qwen 3.5 122B122B (MoE A10B) | Alibaba | 84.2 | 84.8 | 90.1 | 86.6 | 1134 | 35.5 | $0.40 | 262K |
38NEW | GPT-5.4 mini | OpenAI | 87.2 | 88.5 | 91.5 | 79.8 | 1456 | 28.8 | $1.69 | 400K |
394 | Mistral Large | Mistral AI | 81.2 | 80.1 | 82.5 | 62.1 | 1415 | 28.8 | $2.00 | 128K |
402 | Qwen 3.5 397B397B (MoE A17B) | Alibaba | 89.5 | 88.9 | 95.6 | 88.4 | 1067 | 25.3 | $0.60 | 262K |
413 | Command R+104B | Cohere | 75.2 | 75.0 | 72.8 | 55.2 | 1060 | 24.2 | $2.50 | 128K |
424 | Mixtral 8x22B141B (MoE 39B active) | Mistral AI | 72.8 | 71.5 | 68.2 | 48.5 | 1030 | 19.7 | $0.65 | 65K |
431 | Llama 4 Maverick400B (MoE 17B active) | Meta | 87.2 | 86.1 | 91.5 | 78.2 | 1327 | 19.2 | $0.20 | 10M |
442 | MiniMax M2.5? | MiniMax | 83.8 | 83.5 | 89.2 | 74.8 | 1020 | 18.2 | $0.20 | 1M |
456 | Llama 3.1 405B405B | Meta | 79.8 | 79.2 | 78.5 | 60.8 | 1333 | 17.3 | $3.00 | 128K |
46 | Llama 4 Scout109B (MoE 17B active) | Meta | 82.5 | 82.3 | 86.1 | 70.5 | 1322 | 11.5 | $0.11 | 10M |
472 | Phi-414B | Microsoft | 78.5 | 82.5 | 80.8 | 61.5 | 1080 | 3.8 | $0.07 | 16K |
WatchNEW | Devstral 2512benchmark pending | Mistral | — | — | — | — | — | pending | $0.40 | 256K |
WatchNEW | GPT-5.5benchmark pending | OpenAI | — | — | — | — | — | pending | $5.00 | 1.05M |
WatchNEW | GPT-5.5 Probenchmark pending | OpenAI | — | — | — | — | — | pending | $30.00 | 1.05M |
WatchNEW | Grok 4.3benchmark pending | xAI | — | — | — | — | — | pending | $1.25 | 1M |
WatchNEW | MiniMax 01benchmark pending | MiniMax | — | — | — | — | — | pending | $0.20 | 1M |
WatchNEW | Mistral Large 2512benchmark pending | Mistral | — | — | — | — | — | pending | $0.50 | 256K |
WatchNEW | Mistral Medium 3.5benchmark pending | Mistral AI | — | — | — | — | — | pending | $1.50 | 262K |
WatchNEW | OpenAI o3benchmark pending | OpenAI | — | — | — | — | — | pending | $2.00 | 200K |
WatchNEW | OpenAI o3-probenchmark pending | OpenAI | — | — | — | — | — | pending | $20.00 | 200K |
WatchNEW | OpenAI o4-minibenchmark pending | OpenAI | — | — | — | — | — | pending | $1.10 | 200K |
WatchNEW | Qwen3.6 Flashbenchmark pending | Alibaba | — | — | — | — | — | pending | $0.25 | 1M |
WatchNEW | Qwen3.6 Max Previewbenchmark pending | Alibaba | — | — | — | — | — | pending | $1.04 | 262K |
WatchNEW | Qwen3.6 Plusbenchmark pending | Alibaba | — | — | — | — | — | pending | $0.33 | 1M |
WatchNEW | Qwen3 Coder Plusbenchmark pending | Qwen | — | — | — | — | — | pending | $0.65 | 1M |
WatchNEW | Qwen3 Maxbenchmark pending | Qwen | — | — | — | — | — | pending | $0.78 | 256K |
62 models · Sorted by Rank (ascending)
Compare Models
Select up to 3 models to see side-by-side benchmark comparisons.
Select up to 3 models to compare:
| Metric | GPT-5.5 High | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|
| Overall Score | 98.1 | 94.2 | 94.2 |
| Cost/1M In | $2.49 | $15 | $2.5 |
| Cost/1M Out | $9.98 | $75 | $15 |
| Context Window | 400K | 1M | 1M |
| Category | proprietary | proprietary | proprietary |
Methodology
Last updated: May 18, 2026Our benchmark scores are compiled from independently-run evaluations by Epoch AI, Scale AI, LMSYS Chatbot Arena, and official model technical reports. We prioritize independently verified scores over self-reported numbers where available. All scores represent the best reported result for each model on each benchmark.
MMLU
Massive Multitask Language Understanding. Tests knowledge across 57 subjects including STEM, humanities, and social sciences. 14,000+ multiple-choice questions.
MMLU-Pro
Harder version of MMLU with 10 choices per question (vs 4) and more complex reasoning required. Better at differentiating top models.
HumanEval
Coding benchmark with 164 Python programming challenges. Tests function implementation from docstrings. Reports pass@1 rate.
MATH
Competition-style mathematics problems across algebra, geometry, number theory, and more. Level 5 (hardest) used for frontier model comparison.
GPQA Diamond
Graduate-level science questions in physics, chemistry, and biology. PhD experts validate questions; random guessing yields ~25%.
Arena ELO
Chatbot Arena rating from LMSYS. Real humans vote on model outputs in blind head-to-head comparisons. Measures conversational quality.
SimpleQA
Factual accuracy benchmark testing whether models give correct, concise answers to straightforward questions. Measures hallucination rate.
How We Calculate Overall Score
The overall score is a weighted composite of normalized benchmark results:
- • MMLU — 15% weight (breadth of knowledge)
- • MMLU-Pro — 15% weight (harder reasoning)
- • HumanEval — 20% weight (coding ability)
- • MATH — 15% weight (mathematical reasoning)
- • GPQA Diamond — 15% weight (scientific reasoning)
- • Arena ELO — 10% weight (real-world preference)
- • SimpleQA — 10% weight (factual accuracy)
Data Sources
About These Benchmarks
Benchmarks measure specific capabilities under controlled conditions and may not fully represent real-world performance. Different prompting strategies, system prompts, and tool use can significantly change results. We recommend trying models on your specific tasks using our tool finder or the AI tools directory.
AI Model Benchmarks: What You Need to Know in 2026
The AI landscape in 2026 is dominated by a handful of frontier models that push the boundaries of reasoning, coding, and scientific knowledge. Our LLM leaderboard tracks the latest benchmark scores from independent evaluations, giving you an objective way to compare models like GPT-5.5 High, Claude Opus 4.7, and Gemini 3.1 Pro.
Whether you're choosing a model for coding assistance, research, or general chatbot use, benchmarks like MMLU (knowledge), HumanEval (coding),GPQA Diamond (science), and Chatbot Arena ELO (human preference) provide concrete data points for comparison.
Key Takeaways for May 2026
- GPT-5.5 High remains the top scored row in this snapshot; Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.4 follow closely.
- New model listings like GPT-5.5, Grok 4.3, Mistral Medium 3.5, Qwen3.6 variants, and DeepSeek V4 Flash are included as watchlist rows until public benchmark data is available.
- Coding comparisons should use HumanEval and model-specific coding evals alongside price and context window — raw overall rank is not enough.
- Agent workflow scores now belong beside model benchmarks because production fit depends on setup time, memory, security, and ecosystem quality.
- Open-weight models remain strongest when privacy, local deployment, and cost control matter more than absolute frontier performance.
Fast shortlists by use case
Best AI model for coding
Pair HumanEval and coding-agent workflow scores with IDE fit.
Best LLM API provider
Compare hosted API price, latency, context, and routing options.
Best AI agent platform
Evaluate memory, autonomy, setup time, and security posture.
Open-source AI tools
Review repos when self-hosting and privacy matter most.
Explore our full AI tools directory for detailed reviews of each model, or use the AI Tool Finder to discover which model best fits your needs.
AI benchmark FAQ
Which AI model is best overall in this benchmark snapshot?
The top scored row changes as public benchmark data updates. NeuralStackly separates verified benchmark rows from watchlist models so newly listed models do not receive fake rankings before public data is available.
Should I choose ChatGPT, Claude, Gemini, or Grok from benchmark scores alone?
No. Benchmark scores are useful for shortlisting, but production choice should also consider coding reliability, context window, price, privacy, rate limits, tool integrations, and the workflow you need to automate.
Why are AI agent workflow scores included beside model benchmarks?
Software teams increasingly choose full agent stacks, not only base models. Setup time, memory, sandboxing, ecosystem quality, and security posture affect whether an AI stack is actually usable in production.
How often is this AI benchmark page updated?
The benchmark page is reviewed frequently, with priority on frontier model launches, open-weight model releases, coding benchmark updates, and agent framework changes that affect developers.