The definitive ranking of AI models for software development, code generation, and programming tasks based on LiveCodeBench, Terminal-Bench, and SciCode benchmarks. Rankings are based on LiveCodeBench, Terminal-Bench, and SciCode benchmarks from independent evaluations.
Historical snapshot
This page is a dated monthly snapshot. For the live version that is better aligned to current rankings and search intent, use Best LLM for Coding or jump to Best AI Models.
Quality Index
68
Z AI
Quality Index
55
Quality Index
52
DeepSeek
| Rank | Model | Quality | LiveCodeBench | Terminal-Bench | SciCode | License |
|---|---|---|---|---|---|---|
| 1 | GLM-4.7 (Thinking) Z AI | 68 | 89% | 30% | 45% | Open |
| 2 | Gemini 3 Flash (secondary row) | 55 | 80% | 30% | 50% | Proprietary |
| 3 | DeepSeek V3.2 (low performance row) DeepSeek | 52 | 59% | 31% | 39% | Open |
| 4 | GPT-5.2 (xhigh) OpenAI | 50.5 | 89% | 44% | 52% | Proprietary |
| 5 | Claude Opus 4.5 (Reasoning) Anthropic | 49.69 | - | - | - | Proprietary |
| 6 | GLM-5 (Reasoning) Z AI | 49.64 | - | - | - | Open |
| 7 | Gemini 3 Pro Preview (high) | 47.9 | 92% | 39% | 56% | Proprietary |
| 8 | GPT-5.1 (high) OpenAI | 47 | 87% | 43% | 43% | Proprietary |
Our coding model rankings are based on three key benchmarks that evaluate real-world programming capabilities:
Evaluates code generation across multiple programming languages with fresh, contamination-free problems.
Tests complex terminal operations, DevOps tasks, and system-level programming capabilities.
Measures scientific computing and research-oriented programming across multiple domains.
Use our interactive comparison tool to explore pricing, latency, and benchmark scores for all 8 coding models.
As of January 2026, GLM-4.7 (Thinking) leads our coding benchmarks with a 89% score on LiveCodeBench. For open source alternatives, GLM-4.7 Thinking and DeepSeek V3.2 offer comparable performance at a fraction of the cost.
For professional software development, we recommend Claude Opus 4.5 for its excellent code review and debugging capabilities, or GPT-5.2 (xhigh) for complex architectural decisions. Both score above 85% on LiveCodeBench and excel at multi-file code understanding.
GLM-4.7 Thinking achieves 89% on LiveCodeBench while being free to self-host under the MIT license. DeepSeek V3.2 is another excellent choice at $0.35 per million tokens, making it the best value for high-volume coding workloads.
For Ollama users, we recommend DeepSeek Coder V2 (16B parameters), Qwen2.5-Coder (7B or 14B), and CodeLlama 34B. These models run efficiently on consumer hardware while delivering strong coding performance on local machines.
LiveCodeBench tests models on fresh, contamination-free programming problems across multiple languages. Scores above 85% indicate excellent code generation; above 70% is production-ready for most tasks. See our methodology page for details.
GPT-5.2 leads slightly on raw benchmark scores (89% vs 87% on LiveCodeBench), but Claude Opus 4.5 excels at code explanation, debugging, and architectural reasoning. For pure code generation, GPT-5.2; for code review and understanding, Claude Opus 4.5.
Self-hostable models
đ§ŽAIME 2025 rankings
đ¤Tool use & agents
đExpert picks
Data sources: Rankings based on the Artificial Analysis Intelligence Index. Explore all models in our interactive explorer or compare models side-by-side.