The definitive ranking of AI models for mathematics, logical reasoning, and problem-solving based on AIME 2025, GPQA Diamond, and competition math benchmarks. Rankings are based on AIME 2025, GPQA Diamond, and Humanity's Last Exam benchmarks from independent evaluations.
| Rank | Model | Quality | AIME 2025 | GPQA Diamond | HLE | License |
|---|---|---|---|---|---|---|
| 1 | GPT-5.2 (xhigh) OpenAI | 50.5 | 99% | 90% | 31% | Proprietary |
| 2 | GLM-5 (Reasoning) Z AI | 49.64 | - | - | - | Open |
| 3 | Claude Opus 4.5 (high) Anthropic | 49.1 | 91% | 87% | 28% | Proprietary |
| 4 | Gemini 3 Pro Preview (high) | 47.9 | 96% | 91% | 37% | Proprietary |
| 5 | GPT-5.1 (high) OpenAI | 47 | 94% | 87% | 27% | Proprietary |
| 6 | Kimi K2.5 (Reasoning) Kimi | 46.73 | 96% | - | - | Open |
| 7 | Gemini 3 Flash | 45.9 | 97% | 90% | 35% | Proprietary |
| 8 | Gemini 3 Flash Preview (Reasoning) | 45.9 | - | - | - | Proprietary |
The American Invitational Mathematics Examination—tests Olympiad-level problem solving.
Top score
Graduate-level science questions written by domain experts.
Top score
Cutting-edge questions designed to challenge frontier AI systems.
Top score
What the benchmarks show: While models now solve 95%+ of AIME problems, Humanity's Last Exam scores remain below 40%, indicating significant room for improvement in novel, out-of-distribution reasoning. The best models for math combine high AIME scores with strong GPQA Diamond performance, indicating both competition math skills and deep graduate-level understanding.
Use our interactive comparison tool to explore reasoning benchmarks, pricing, and latency for all 8 math models.
As of January 2026, GPT-5.2 (xhigh) leads our math benchmarks with exceptional scores on AIME 2025 (99%) and GPQA Diamond (90%). For complex competition math, GPT-5.2 (xhigh) achieves 99% on AIME 2025—near-perfect performance.
Yes, modern AI models excel at calculus, linear algebra, differential equations, and even competition-level number theory. The top models score above 90% on graduate-level GPQA Diamond questions covering physics, chemistry, and biology with mathematical components. However, novel research-level problems (like those in Humanity's Last Exam) remain challenging.
For free/open-source math assistance, GLM-4.7 Thinking and DeepSeek V3.2offer outstanding performance. GLM-4.7 achieves 95% on AIME 2025 and can be self-hosted, while DeepSeek V3.2 offers competitive API pricing at a fraction of proprietary model costs.