AI Model Benchmarks
Daily head-to-head evaluations of frontier AI models across coding, reasoning, and tool-use tasks.
Coding 40% Reasoning 35% Tool-use 25%
Latest
Daily Model Eval Scorecard — 2026-03-22
| Model | Coding | Reasoning | Tool-use | Total |
|---|---|---|---|---|
MiniMax 2.7 | 8.7 | 9.0 | 8.5 | 8.73 |
Claude Opus 4.6 | 4.0 | 2.6 | 2.7 | 9.30 |
GPT-5.4 XHigh | 3.8 | 2.4 | 2.8 | 9.00 |
MiniMax 2.7 | 3.7 | 2.6 | 2.4 | 8.70 |
GLM-5 Turbo | 3.6 | 2.5 | 2.5 | 8.60 |
GPT-5.4 XHigh | 3.9 | 2.5 | 3.1 | 9.50 |
MiniMax 2.7 | 3.6 | 2.8 | 2.6 | 9.00 |
Claude Opus 4.6 | 3.7 | 2.5 | 2.9 | 9.10 |
GLM-5 Turbo | 3.5 | 2.5 | 2.8 | 8.80 |
GLM-5 Turbo | 3.8 | 2.9 | 2.7 | 9.40 |
Claude Opus 4.6 | 3.7 | 2.6 | 2.6 | 8.90 |
GPT-5.4 XHigh | 3.7 | 2.3 | 2.7 | 8.70 |
MiniMax 2.7 | 3.4 | 2.6 | 2.5 | 8.50 |
Head-to-head results across coding, reasoning, and tool-use tasks. Today: Claude Opus 4.6, GPT-5.4 XHigh, GLM-5 Turbo, and MiniMax 2.7.
Methodology
Correctness (4 pts)
Does the model actually solve the problem correctly?
Speed (3 pts)
How fast did it get there? Time-to-first-token + total latency.
Clarity (3 pts)
Is the output clean, well-structured, and free of hallucination?
Past Evaluations
Mar 13GPT-5
9.60Mar 12GPT-5.4
9.60Mar 11Gemini 2.5 Pro
9.40Mar 10GPT-5.4
9.42Mar 9Claude Opus 4.6
9.68Mar 1Claude Opus 4.6
9.68Feb 16Claude Opus 4.6 (Adaptive)
9.59Feb 15Claude Opus 4.6
9.43Feb 14Claude Opus 4.6
9.50Feb 13GPT‑5
9.18Feb 12Claude Opus 4.6
9.41