Curated Snapshot · May 18, 2026

AI model benchmarks: ChatGPT vs Claude vs Gemini vs Grok.

Compare 62 frontier and open-weight AI models across benchmarks, pricing, and context windows, then evaluate open-source agent stacks like OpenClaw, Hermes Agent, OpenCode, Cline, and DeerFlow on practical workflow fit. Watchlist rows are separated from verified scores so new model listings do not imply fake rankings.

62 Models 47 scored · 15 watchlist 8 Agent scores42 Proprietary · 20 Open

GPT-5.5 High

OpenAI · 400K context

98.1

score

Claude Opus 4.7

Anthropic · 1M context

94.2

score

Gemini 3.1 Pro

Google · 1M context

94.2

score

Best Overall

GPT-5.5 High

OpenAI

98.1 overall

Best Coding

Claude Opus 4.7

Anthropic

96.4% HumanEval

Best Open Source

Kimi K2.6

Moonshot AI

88.5 overall

Best Value

Muse Spark

Open-source agent workflow leaderboard

This NeuralStackly score combines public GitHub signals, docs, setup ergonomics, autonomy, memory, security posture, ecosystem depth, and developer workflow fit. It is a practical adoption score, not a vendor-reported model benchmark.

Open-source repos

Agents scored

954k

Repo stars tracked

91.0

Top score

#1OpenClawpersonal agent

369k GitHub stars

91.0

Always-on personal agents, multi-channel automation, skills, and OpenClaw ecosystem searches.

#2Hermes Agentpersonal agent

135k GitHub stars

90.5

Self-improving long-running agents with persistent memory, cron, skills, and provider routing.

#3OpenCodecoding agent

156k GitHub stars

88.0

Terminal-native coding agent workflows with provider control and open-source ergonomics.

#4Clinecoding agent

61k GitHub stars

84.5

IDE-based coding agents that ask permission before editing files or executing commands.

Agent	Stars	Setup	Autonomy	Memory	Security	Ecosystem	Score
#1OpenClaw	369k	72	94	82	84	99	91.0
#2Hermes Agent	135k	78	92	95	86	88	90.5
#3OpenCode	156k	86	88	76	84	90	88.0
#4Cline	61k	84	84	72	86	82	84.5
#5DeerFlow	65k	68	90	84	90	82	83.5
#6Goose	44k	82	84	76	84	78	82.0
#7Browser Use	92k	86	78	58	78	88	79.0
#8LangGraph	31k	76	82	78	82	84	78.5

What OpenClaw owns

Search demand, ecosystem gravity, chat channels, skills, and broad personal-agent automation.

What Hermes owns

Memory, scheduled work, self-improvement, terminal workflow, and multi-provider routing.

What to test next

End-to-end agent tasks: setup time, secrets handling, sandbox boundaries, task completion, and rollback.

Curated Leaderboard

Ranks are unique for scored rows. Newly listed models show as watchlist until public benchmark data is available.

Rank	Model	Company	MMLU	HumanEval	MATH	GPQA	Arena ELOLive	Overall	Cost/1M	Context
1NEW	GPT-5.5 High	OpenAI	89.1	89.7	93.2	81.5	1484	98.1	$2.49	400K
2NEW	Claude Opus 4.7	Anthropic	95.6	96.4	98.2	91.0	1492	94.2	$15.00	1M
32	Gemini 3.1 Pro	Google	94.3	93.2	98.1	94.1	1493	94.2	$2.50	1M
43	GPT-5.4	OpenAI	93.8	96.3	97.9	92.8	1467	94.2	$2.50	1M
51	Claude Opus 4.6	Anthropic	94.2	95.1	97.6	90.5	1498	90.3	$15.00	1M
6NEW	GPT-5.3 Codex	OpenAI	91.2	95.8	96.2	84.8	1300	88.5	$3.00	400K
7NEW	Kimi K2.61T (MoE 32B active)	Moonshot AI	91.8	90.1	96.5	85.4	1529	88.5	$0.95	256K
8NEW	Grok 4.20	xAI	91.2	91.5	95.1	79.4	1481	88.0	$3.00	256K
92	GPT-5	OpenAI	91.5	92.1	97.9	88.1	1475	87.1	$1.25	400K
102	Claude Opus 4.5	Anthropic	92.8	94.2	96.1	87.0	1469	86.2	$5.00	1M
11	Claude Sonnet 4.6	Anthropic	91.8	93.8	97.8	89.9	1468	84.6	$6.00	1M
12NEW	DeepSeek V4 Pro	DeepSeek	—	—	—	—	1463	84.6	$0.43	1M
13NEW	Muse Spark	Meta	91.8	92.1	95.4	80.7	1490	84.6	$0.00	262K
14NEW	Qwen3.6 Max	Alibaba	90.5	90.8	94.6	77.2	1362	84.6	$1.60	128K
152	Claude Sonnet 4.5	Anthropic	89.1	90.8	97.7	83.4	1453	83.8	$3.00	1M
165	GLM-5.1?	Zhipu AI	89.7	88.1	94.5	86.2	1534	82.7	$2.15	200K
171	GPT-5.2	OpenAI	92.1	94.8	98.0	91.4	1436	81.5	$1.75	400K
181	GLM-5?	Zhipu AI	91.2	90.5	95.3	77.8	1470	80.8	$1.00	200K
19NEW	MiniMax M2.7?	MiniMax	89.2	86.8	93.5	82.1	1250	80.8	$0.32	205K
203	DeepSeek V3671B (MoE 37B active)	DeepSeek	85.1	84.2	88.5	72.4	1424	79.4	$0.27	128K
212	Grok 4	xAI	88.0	87.5	93.2	81.8	1460	78.8	$3.00	2M
223	GPT-4.1	OpenAI	88.2	91.5	93.8	80.5	1413	77.7	$2.00	1M
23NEW	DeepSeek V4 Flash	DeepSeek	—	—	—	—	1433	75.0	$0.14	1M
24	Gemini 3 Flash	Google	90.8	89.6	95.2	90.4	1473	73.1	$0.50	1M
25NEW	Gemini 3 Flash Preview	Google	—	—	—	—	—	73.1	$0.50	1M
263	Claude 3.5 Sonnet	Anthropic	82.1	85.8	84.1	65.3	1372	71.5	$3.00	200K
272	GPT-4o	OpenAI	83.5	86.5	85.2	68.4	1345	67.4	$2.50	128K
28NEW	DeepSeek V3.2685B (37B active, MoE)	DeepSeek	89.5	89.3	93.2	75.8	1423	65.4	$0.28	128K
291	Gemini 3 Pro	Google	93.5	92.0	98.5	91.9	1486	63.5	$2.00	1M
303	Gemma 4 31B31B	Google	80.5	79.8	83.8	68.5	950	59.6	$0.10	128K
31NEW	OpenAI o3-mini	OpenAI	—	—	—	—	—	57.7	$1.10	200K
324	Kimi K2.5?	Moonshot AI	90.4	89.3	96.8	87.6	1432	55.8	$0.60	262K
33NEW	Nemotron 3 Super120B (12B active)	NVIDIA	85.3	84.2	88.5	68.9	1361	53.8	$0.20	128K
344	Gemini 2.5 Pro	Google	87.8	88.4	92.3	79.8	1447	51.9	$1.25	1M
35NEW	gpt-oss-120B120B	OpenAI	83.1	81.6	86.2	65.4	1220	48.1	$0.15	128K
365	Grok 3	xAI	84.5	83.5	87.8	72.1	1362	46.2	$3.00	131K
37	Qwen 3.5 122B122B (MoE A10B)	Alibaba	84.2	84.8	90.1	86.6	1134	35.5	$0.40	262K
38NEW	GPT-5.4 mini	OpenAI	87.2	88.5	91.5	79.8	1456	28.8	$1.69	400K
394	Mistral Large	Mistral AI	81.2	80.1	82.5	62.1	1415	28.8	$2.00	128K
402	Qwen 3.5 397B397B (MoE A17B)	Alibaba	89.5	88.9	95.6	88.4	1067	25.3	$0.60	262K
413	Command R+104B	Cohere	75.2	75.0	72.8	55.2	1060	24.2	$2.50	128K
424	Mixtral 8x22B141B (MoE 39B active)	Mistral AI	72.8	71.5	68.2	48.5	1030	19.7	$0.65	65K
431	Llama 4 Maverick400B (MoE 17B active)	Meta	87.2	86.1	91.5	78.2	1327	19.2	$0.20	10M
442	MiniMax M2.5?	MiniMax	83.8	83.5	89.2	74.8	1020	18.2	$0.20	1M
456	Llama 3.1 405B405B	Meta	79.8	79.2	78.5	60.8	1333	17.3	$3.00	128K
46	Llama 4 Scout109B (MoE 17B active)	Meta	82.5	82.3	86.1	70.5	1322	11.5	$0.11	10M
472	Phi-414B	Microsoft	78.5	82.5	80.8	61.5	1080	3.8	$0.07	16K
WatchNEW	Devstral 2512benchmark pending	Mistral	—	—	—	—	—	pending	$0.40	256K
WatchNEW	GPT-5.5benchmark pending	OpenAI	—	—	—	—	—	pending	$5.00	1.05M
WatchNEW	GPT-5.5 Probenchmark pending	OpenAI	—	—	—	—	—	pending	$30.00	1.05M
WatchNEW	Grok 4.3benchmark pending	xAI	—	—	—	—	—	pending	$1.25	1M
WatchNEW	MiniMax 01benchmark pending	MiniMax	—	—	—	—	—	pending	$0.20	1M
WatchNEW	Mistral Large 2512benchmark pending	Mistral	—	—	—	—	—	pending	$0.50	256K
WatchNEW	Mistral Medium 3.5benchmark pending	Mistral AI	—	—	—	—	—	pending	$1.50	262K
WatchNEW	OpenAI o3benchmark pending	OpenAI	—	—	—	—	—	pending	$2.00	200K
WatchNEW	OpenAI o3-probenchmark pending	OpenAI	—	—	—	—	—	pending	$20.00	200K
WatchNEW	OpenAI o4-minibenchmark pending	OpenAI	—	—	—	—	—	pending	$1.10	200K
WatchNEW	Qwen3.6 Flashbenchmark pending	Alibaba	—	—	—	—	—	pending	$0.25	1M
WatchNEW	Qwen3.6 Max Previewbenchmark pending	Alibaba	—	—	—	—	—	pending	$1.04	262K
WatchNEW	Qwen3.6 Plusbenchmark pending	Alibaba	—	—	—	—	—	pending	$0.33	1M
WatchNEW	Qwen3 Coder Plusbenchmark pending	Qwen	—	—	—	—	—	pending	$0.65	1M
WatchNEW	Qwen3 Maxbenchmark pending	Qwen	—	—	—	—	—	pending	$0.78	256K

62 models · Sorted by Rank (ascending)

Compare Models

Select up to 3 models to see side-by-side benchmark comparisons.

Select up to 3 models to compare:

GPT-5.5 High

Claude Opus 4.7

Gemini 3.1 Pro

MMLUMassive Multitask Language Understanding

GPT-5.5 High

89.1

Claude Opus 4.7

95.6

Gemini 3.1 Pro

94.3

MMLU-ProHarder version of MMLU with 10 choices per question (vs 4) and more complex reasoning required

GPT-5.5 High

82.3

Claude Opus 4.7

92.3

Gemini 3.1 Pro

91.0

HumanEvalCoding benchmark with 164 Python programming challenges

GPT-5.5 High

89.7

Claude Opus 4.7

96.4

Gemini 3.1 Pro

93.2

MATHCompetition-style mathematics problems across algebra, geometry, number theory, and more

GPT-5.5 High

93.2

Claude Opus 4.7

98.2

Gemini 3.1 Pro

98.1

GPQA DiamondGraduate-level science questions in physics, chemistry, and biology

GPT-5.5 High

81.5

Claude Opus 4.7

91.0

Gemini 3.1 Pro

94.1

SimpleQAFactual accuracy benchmark testing whether models give correct, concise answers to straightforward questions

GPT-5.5 High

36.2

Claude Opus 4.7

45.1

Gemini 3.1 Pro

44.2

Metric	GPT-5.5 High	Claude Opus 4.7	Gemini 3.1 Pro
Overall Score	98.1	94.2	94.2
Cost/1M In	$2.49	$15	$2.5
Cost/1M Out	$9.98	$75	$15
Context Window	400K	1M	1M
Category	proprietary	proprietary	proprietary

Methodology

Last updated: May 18, 2026

Our benchmark scores are compiled from independently-run evaluations by Epoch AI, Scale AI, LMSYS Chatbot Arena, and official model technical reports. We prioritize independently verified scores over self-reported numbers where available. All scores represent the best reported result for each model on each benchmark.

MMLU

Massive Multitask Language Understanding. Tests knowledge across 57 subjects including STEM, humanities, and social sciences. 14,000+ multiple-choice questions.

MMLU-Pro

Harder version of MMLU with 10 choices per question (vs 4) and more complex reasoning required. Better at differentiating top models.

HumanEval

Coding benchmark with 164 Python programming challenges. Tests function implementation from docstrings. Reports pass@1 rate.

MATH

Competition-style mathematics problems across algebra, geometry, number theory, and more. Level 5 (hardest) used for frontier model comparison.

GPQA Diamond

Graduate-level science questions in physics, chemistry, and biology. PhD experts validate questions; random guessing yields ~25%.

Arena ELO

Chatbot Arena rating from LMSYS. Real humans vote on model outputs in blind head-to-head comparisons. Measures conversational quality.

SimpleQA

Factual accuracy benchmark testing whether models give correct, concise answers to straightforward questions. Measures hallucination rate.

How We Calculate Overall Score

The overall score is a weighted composite of normalized benchmark results:

• MMLU — 15% weight (breadth of knowledge)
• MMLU-Pro — 15% weight (harder reasoning)
• HumanEval — 20% weight (coding ability)
• MATH — 15% weight (mathematical reasoning)
• GPQA Diamond — 15% weight (scientific reasoning)
• Arena ELO — 10% weight (real-world preference)
• SimpleQA — 10% weight (factual accuracy)

Data Sources

Arena ELO: live · May 7, 2026

Artificial Analysis / public benchmark pages where availableLMSYS / Chatbot Arena where availableOpenRouter model availability and pricingOfficial provider model cardsLMSYS Arena (live)

About These Benchmarks

Benchmarks measure specific capabilities under controlled conditions and may not fully represent real-world performance. Different prompting strategies, system prompts, and tool use can significantly change results. We recommend trying models on your specific tasks using our tool finder or the AI tools directory.

AI Model Benchmarks: What You Need to Know in 2026

The AI landscape in 2026 is dominated by a handful of frontier models that push the boundaries of reasoning, coding, and scientific knowledge. Our LLM leaderboard tracks the latest benchmark scores from independent evaluations, giving you an objective way to compare models like GPT-5.5 High, Claude Opus 4.7, and Gemini 3.1 Pro.

Whether you're choosing a model for coding assistance, research, or general chatbot use, benchmarks like MMLU (knowledge), HumanEval (coding),GPQA Diamond (science), and Chatbot Arena ELO (human preference) provide concrete data points for comparison.

Key Takeaways for May 2026

GPT-5.5 High remains the top scored row in this snapshot; Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.4 follow closely.
New model listings like GPT-5.5, Grok 4.3, Mistral Medium 3.5, Qwen3.6 variants, and DeepSeek V4 Flash are included as watchlist rows until public benchmark data is available.
Coding comparisons should use HumanEval and model-specific coding evals alongside price and context window — raw overall rank is not enough.
Agent workflow scores now belong beside model benchmarks because production fit depends on setup time, memory, security, and ecosystem quality.
Open-weight models remain strongest when privacy, local deployment, and cost control matter more than absolute frontier performance.

Fast shortlists by use case

Best AI model for coding

Pair HumanEval and coding-agent workflow scores with IDE fit.

Best LLM API provider

Compare hosted API price, latency, context, and routing options.

Best AI agent platform

Evaluate memory, autonomy, setup time, and security posture.

Open-source AI tools

Review repos when self-hosting and privacy matter most.

Explore our full AI tools directory for detailed reviews of each model, or use the AI Tool Finder to discover which model best fits your needs.

Find the Right AI Model Browse All AI Tools Open Source Repos

AI benchmark FAQ

Which AI model is best overall in this benchmark snapshot?

The top scored row changes as public benchmark data updates. NeuralStackly separates verified benchmark rows from watchlist models so newly listed models do not receive fake rankings before public data is available.

Should I choose ChatGPT, Claude, Gemini, or Grok from benchmark scores alone?

No. Benchmark scores are useful for shortlisting, but production choice should also consider coding reliability, context window, price, privacy, rate limits, tool integrations, and the workflow you need to automate.

Why are AI agent workflow scores included beside model benchmarks?

Software teams increasingly choose full agent stacks, not only base models. Setup time, memory, sandboxing, ecosystem quality, and security posture affect whether an AI stack is actually usable in production.

How often is this AI benchmark page updated?

The benchmark page is reviewed frequently, with priority on frontier model launches, open-weight model releases, coding benchmark updates, and agent framework changes that affect developers.