AI Model Benchmarks

Daily head-to-head evaluations of frontier AI models across coding, reasoning, and tool-use tasks.

Coding 40% Reasoning 35% Tool-use 25%
Latest

Daily Model Eval Scorecard — 2026-03-22

ModelCodingReasoningTool-useTotal
MiniMax 2.7
8.79.08.58.73
Claude Opus 4.6
4.02.62.79.30
GPT-5.4 XHigh
3.82.42.89.00
MiniMax 2.7
3.72.62.48.70
GLM-5 Turbo
3.62.52.58.60
GPT-5.4 XHigh
3.92.53.19.50
MiniMax 2.7
3.62.82.69.00
Claude Opus 4.6
3.72.52.99.10
GLM-5 Turbo
3.52.52.88.80
GLM-5 Turbo
3.82.92.79.40
Claude Opus 4.6
3.72.62.68.90
GPT-5.4 XHigh
3.72.32.78.70
MiniMax 2.7
3.42.62.58.50

Head-to-head results across coding, reasoning, and tool-use tasks. Today: Claude Opus 4.6, GPT-5.4 XHigh, GLM-5 Turbo, and MiniMax 2.7.

Methodology

Correctness (4 pts)

Does the model actually solve the problem correctly?

Speed (3 pts)

How fast did it get there? Time-to-first-token + total latency.

Clarity (3 pts)

Is the output clean, well-structured, and free of hallucination?

Past Evaluations

Mar 13GPT-5
9.60
Mar 12GPT-5.4
9.60
Mar 11Gemini 2.5 Pro
9.40
Mar 10GPT-5.4
9.42
Mar 9Claude Opus 4.6
9.68
Mar 1Claude Opus 4.6
9.68
Feb 16Claude Opus 4.6 (Adaptive)
9.59
Feb 15Claude Opus 4.6
9.43
Feb 14Claude Opus 4.6
9.50
Feb 13GPT‑5
9.18
Feb 12Claude Opus 4.6
9.41