The definitive ranking of AI models for mathematics, logical reasoning, and problem-solving based on AIME 2025, GPQA Diamond, and competition math benchmarks. Rankings are based on AIME 2025, GPQA Diamond, and Humanity's Last Exam benchmarks from independent evaluations.
Quality Index
61
ByteDance Seed
Quality Index
50.5
OpenAI
Quality Index
49.1
Anthropic
| Rank | Model | Quality | AIME 2025 | GPQA Diamond | HLE | License |
|---|---|---|---|---|---|---|
| 1 | Doubao-Seed-1.8 ByteDance Seed | 61 | 85% | 80% | 15% | Proprietary |
| 2 | GPT-5.2 (xhigh) OpenAI | 50.5 | 99% | 90% | 31% | Proprietary |
| 3 | Claude Opus 4.5 (high) Anthropic | 49.1 | 91% | 87% | 28% | Proprietary |
| 4 | Gemini 3 Pro Preview (high) | 47.9 | 96% | 91% | 37% | Proprietary |
| 5 | GPT-5.1 (high) OpenAI | 47 | 94% | 87% | 27% | Proprietary |
| 6 | Gemini 3 Flash | 45.9 | 97% | 90% | 35% | Proprietary |
| 7 | Claude 4.5 Sonnet Anthropic | 42.4 | 88% | 83% | 17% | Proprietary |
| 8 | GLM-4.7 (Thinking) Z AI | 41.7 | 95% | 86% | 25% | Open |
The American Invitational Mathematics Examination—tests Olympiad-level problem solving.
Top score
Graduate-level science questions written by domain experts.
Top score
Cutting-edge questions designed to challenge frontier AI systems.
Top score
What the benchmarks show: While models now solve 95%+ of AIME problems, Humanity's Last Exam scores remain below 40%, indicating significant room for improvement in novel, out-of-distribution reasoning. The best models for math combine high AIME scores with strong GPQA Diamond performance, indicating both competition math skills and deep graduate-level understanding.
Use our interactive comparison tool to explore reasoning benchmarks, pricing, and latency for all 8 math models.
As of January 2026, Doubao-Seed-1.8 leads our math benchmarks with exceptional scores on AIME 2025 (85%) and GPQA Diamond (80%). For complex competition math, GPT-5.2 (xhigh) achieves 99% on AIME 2025—near-perfect performance.
Yes, modern AI models excel at calculus, linear algebra, differential equations, and even competition-level number theory. The top models score above 90% on graduate-level GPQA Diamond questions covering physics, chemistry, and biology with mathematical components. However, novel research-level problems (like those in Humanity's Last Exam) remain challenging.
For free/open-source math assistance, GLM-4.7 Thinking and DeepSeek V3.2offer outstanding performance. GLM-4.7 achieves 95% on AIME 2025 and can be self-hosted, while DeepSeek V3.2 offers competitive API pricing at a fraction of proprietary model costs.