The definitive ranking of open-weight AI models you can self-host, fine-tune, and deploy without restrictions. Ranked by real benchmarks — MMLU-Pro, AIME 2025, and LiveCodeBench. Includes Ollama setup recommendations. Updated weekly.
| Rank | Model | Quality | MMLU-Pro | AIME 2025 | LiveCodeBench | Context |
|---|---|---|---|---|---|---|
| 1 | GLM-5 (Reasoning) Z AI | 49.64 | - | - | - | 203K |
| 2 | Kimi K2.5 (Reasoning) Kimi | 46.73 | - | 96% | 85% | 256K |
| 3 | MiniMax-M2.5 MiniMax | 41.97 | - | - | - | 205K |
| 4 | GLM-4.7 (Thinking) Z AI | 41.7 | 86% | 95% | 89% | 200K |
| 5 | DeepSeek V3.2 DeepSeek | 41.2 | 86% | 92% | 86% | 128K |
| 6 | Kimi K2 Thinking Kimi | 40.3 | 85% | 95% | 85% | 256K |
| 7 | MiniMax-M2.1 MiniMax | 39.3 | 88% | 83% | 81% | 205K |
| 8 | MiMo-V2-Flash Xiaomi | 39 | 84% | 96% | 87% | 256K |
| 9 | Llama Nemotron Ultra NVIDIA | 38 | 83% | 64% | 64% | 128K |
| 10 | MiniMax-M2 MiniMax | 35.7 | 82% | 78% | 83% | 205K |
| 11 | DeepSeek V3.2 Speciale DeepSeek | 34.1 | 86% | 97% | 90% | 128K |
| 12 | DeepSeek V3.1 Terminus DeepSeek | 33.4 | 85% | 90% | 80% | 128K |
Filtered by LiveCodeBench score — the gold standard for coding benchmarks.
| Model | LiveCodeBench | Quality Index | Best For |
|---|---|---|---|
| DeepSeek V3.2 Speciale DeepSeek | 90% | 34.1 | Top open source pick overall |
| GLM-4.7 (Thinking) Z AI | 89% | 41.7 | Best for API use (cheap) |
| MiMo-V2-Flash Xiaomi | 87% | 39 | Best for Ollama / local |
| DeepSeek V3.2 DeepSeek | 86% | 41.2 | Best for reasoning-heavy tasks |
| Kimi K2.5 (Reasoning) Kimi | 85% | 46.73 | Strong alternative |
Ollama makes it easy to run open-weight models locally. Here are the top picks by hardware tier:
RTX 3070 · M2 MacBook Air
RTX 3090/4090 · M2 Pro/Max
2× RTX 4090 · Mac Studio M2 Ultra
Tip: Use 4-bit quantization (Q4_K_M) to roughly halve VRAM requirements with minimal quality loss. For example, Llama 3.3 70B at Q4_K_M runs in ~40GB.
Only models with openly available weights (Apache 2.0, MIT, Llama community license, or similar open licenses) are included. Rankings use the Artificial Analysis Quality Index as the primary metric, combined with:
Comprehensive knowledge benchmark across 14 domains. Tests breadth of model capability.
Competition math — tests advanced reasoning. Best signal for math and science tasks.
Contamination-free code generation. Best signal for software development capability.
See live pricing from self-hosting providers, latency, and full benchmark scores for all open source models.
GLM-5 (Reasoning) leads open source rankings in 2026 with a Quality Index of 49.64. For API use, DeepSeek V3.2 is the best value at $0.35/M tokens. For Ollama/local use, Qwen2.5-Coder 32B and Llama 3.3 70B have the best community support.
For Ollama in 2026: Qwen2.5-Coder 32B for coding (needs 24GB VRAM), Llama 3.3 70B for general tasks (needs 40GB at Q4), and DeepSeek R1 distilled variants for reasoning. For 8GB VRAM, Gemma 3 4B and Qwen2.5 7B are the best small models.
For most tasks, yes. The top open source models in 2026 trail proprietary leaders by only 3–8 Quality Index points. On math benchmarks, DeepSeek R1 actually surpasses many proprietary alternatives. The main gaps remain in instruction-following polish, multimodal capability, and very long contexts.
Minimum recommendations: 8GB VRAM for 7B models (Gemma 3 4B, Qwen2.5 7B), 24GB VRAM for 32B models (Qwen2.5-Coder 32B, DeepSeek Coder V2 16B), 40GB+ for 70B models at 4-bit quantization. Apple Silicon (M-series) is excellent — 16GB unified memory handles 7B models comfortably, and 64GB handles 32B models well.
GLM-4.7 Thinking leads open source coding models with 89% on LiveCodeBench. For Ollama users, Qwen2.5-Coder 32B is the best local option. For cheap API access, DeepSeek V3.2 at $0.35/M tokens delivers 86% LiveCodeBench — matching Claude 3.5 Sonnet. See the full coding LLM ranking for more detail.
All models ranked
🧮AIME 2025 rankings
📄Best for documents
⚡Compare any 2–4 models
Data sources: Rankings based on the Artificial Analysis Intelligence Index. Explore all models in our interactive leaderboard or compare models side by side.