The definitive ranking of AI models for software development, code generation, and programming tasks based on LiveCodeBench, Terminal-Bench, and SciCode benchmarks. Rankings are based on LiveCodeBench, Terminal-Bench, and SciCode benchmarks from independent evaluations.
| Rank | Model | Quality | LiveCodeBench | Terminal-Bench | SciCode | License |
|---|---|---|---|---|---|---|
| 1 | GPT-5.2 (xhigh) OpenAI | 50.5 | 89% | 44% | 52% | Proprietary |
| 2 | GLM-5 (Reasoning) Z AI | 49.64 | - | - | - | Open |
| 3 | Claude Opus 4.5 (high) Anthropic | 49.1 | 87% | 44% | 50% | Proprietary |
| 4 | Gemini 3 Pro Preview (high) | 47.9 | 92% | 39% | 56% | Proprietary |
| 5 | GPT-5.1 (high) OpenAI | 47 | 87% | 43% | 43% | Proprietary |
| 6 | Kimi K2.5 (Reasoning) Kimi | 46.73 | 85% | - | - | Open |
| 7 | Gemini 3 Flash | 45.9 | 91% | 36% | 51% | Proprietary |
| 8 | Gemini 3 Flash Preview (Reasoning) | 45.9 | - | - | - | Proprietary |
Our coding model rankings are based on three key benchmarks that evaluate real-world programming capabilities:
Evaluates code generation across multiple programming languages with fresh, contamination-free problems.
Tests complex terminal operations, DevOps tasks, and system-level programming capabilities.
Measures scientific computing and research-oriented programming across multiple domains.
Use our interactive comparison tool to explore pricing, latency, and benchmark scores for all 8 coding models.
As of January 2026, GPT-5.2 (xhigh) leads our coding benchmarks with a 89% score on LiveCodeBench. For open source alternatives, GLM-4.7 Thinking and DeepSeek V3.2 offer comparable performance at a fraction of the cost.
For professional software development, we recommend Claude Opus 4.5 for its excellent code review and debugging capabilities, or GPT-5.2 (xhigh) for complex architectural decisions. Both score above 85% on LiveCodeBench and excel at multi-file code understanding.
GLM-4.7 Thinking achieves 89% on LiveCodeBench while being free to self-host under the MIT license. DeepSeek V3.2 is another excellent choice at $0.35 per million tokens, making it the best value for high-volume coding workloads.
For Ollama users, we recommend DeepSeek Coder V2 (16B parameters), Qwen2.5-Coder (7B or 14B), and CodeLlama 34B. These models run efficiently on consumer hardware while delivering strong coding performance on local machines.
LiveCodeBench tests models on fresh, contamination-free programming problems across multiple languages. Scores above 85% indicate excellent code generation; above 70% is production-ready for most tasks. See our methodology page for details.
GPT-5.2 leads slightly on raw benchmark scores (89% vs 87% on LiveCodeBench), but Claude Opus 4.5 excels at code explanation, debugging, and architectural reasoning. For pure code generation, GPT-5.2; for code review and understanding, Claude Opus 4.5.
Self-hostable models
đ§ŽAIME 2025 rankings
đ¤Tool use & agents
đExpert picks
Data sources: Rankings based on the Artificial Analysis Intelligence Index. Explore all models in our interactive explorer or compare models side-by-side.