The definitive ranking of AI models for software development, code generation, and programming. Ranked by LiveCodeBench, Terminal-Bench, and SciCode โ independent, contamination-free evaluations. Updated weekly.
| Rank | Model | Quality | LiveCodeBench | Terminal-Bench | SciCode | License |
|---|---|---|---|---|---|---|
| 1 | GPT-5.2 (xhigh) OpenAI | 50.5 | 89% | 44% | 52% | Proprietary |
| 2 | GLM-5 (Reasoning) Z AI | 49.64 | - | - | - | Open |
| 3 | Claude Opus 4.5 (high) Anthropic | 49.1 | 87% | 44% | 50% | Proprietary |
| 4 | Gemini 3 Pro Preview (high) | 47.9 | 92% | 39% | 56% | Proprietary |
| 5 | GPT-5.1 (high) OpenAI | 47 | 87% | 43% | 43% | Proprietary |
| 6 | Kimi K2.5 (Reasoning) Kimi | 46.73 | 85% | - | - | Open |
| 7 | Gemini 3 Flash | 45.9 | 91% | 36% | 51% | Proprietary |
| 8 | Gemini 3 Flash Preview (Reasoning) | 45.9 | - | - | - | Proprietary |
| 9 | Claude 4.5 Sonnet Anthropic | 42.4 | 71% | 33% | 45% | Proprietary |
| 10 | MiniMax-M2.5 MiniMax | 41.97 | - | - | - | Open |
Rankings combine three independent benchmarks that test real programming capabilities, not just pattern matching:
Contamination-free code generation problems updated monthly across Python, JavaScript, C++, and more. The gold standard for coding benchmarks.
Tests complex terminal operations, shell scripting, DevOps tasks, and system-level programming โ critical for real-world engineering.
Scientific computing and research programming. Tests ability to implement algorithms from papers and numerical methods correctly.
Quality Index from Artificial Analysis. Benchmark data updated weekly from public leaderboards.
See exact pricing, latency, and benchmark scores for all 10 coding models in our interactive comparison tool.
As of 2026, GPT-5.2 (xhigh) leads our coding benchmarks. For open source alternatives, GLM-4.7 Thinking and DeepSeek V3.2 offer comparable performance at a fraction of the cost, making them excellent choices for both API use and self-hosting.
For professional software development, Claude Opus 4.5 excels at code review, debugging, and architectural reasoning. GPT-5.2 leads on raw code generation benchmarks. Both score above 85% on LiveCodeBench and handle multi-file codebases well.
GLM-4.7 Thinking is the top open-weight coding model in 2026, free to self-host. DeepSeek V3.2 is the best value via API at $0.35/M tokens โ matching proprietary models at 1/10th the cost. Both are available on Ollama and HuggingFace.
For Ollama, the best coding models are: DeepSeek Coder V2 (16B, excellent for Python/JS), Qwen2.5-Coder 32B (strong on competitive programming), and Llama 3.3 70B (best general-purpose model you can run locally). All run on a 24GB+ GPU.
GPT-5.2 slightly leads on raw LiveCodeBench scores, but Claude Opus 4.5 is generally preferred for real-world coding tasks: better at explaining code, catching subtle bugs, and long multi-file refactors. For pure code generation speed, GPT-5.2. For code review and engineering quality, Claude Opus 4.5.
This ranking is updated weekly as new benchmark results and model releases become available. Quality Index scores are pulled live from Artificial Analysis. When major new models are released (GPT-5 series, Claude 4 series, etc.), rankings are updated within 24โ48 hours.
Self-hostable models
๐งฎAIME 2025 rankings
๐คTool use & agents
โกAny 2โ4 models
Data sources: Rankings based on the Artificial Analysis Intelligence Index. Explore all models in our interactive leaderboard or compare models side by side.