Self-host, fine-tune, and deploy without restrictions. Open source models now hit 90% on LiveCodeBench and 97% on AIME 2025 - rivaling the best proprietary models.
MoonshotAI - 1T parameters, 256K context
Debuts at #2 with a Quality Index of 46.77. Scores 96% on AIME 2025 (best of any open source model) and 85% on LiveCodeBench. Open-weight, commercially usable.
| Rank | Model | Quality | LiveCodeBench | AIME 2025 | MMLU-Pro |
|---|---|---|---|---|---|
| 1 | Kimi K2.5 (Reasoning) Kimi | 46.77 | 85% | 96% | - |
| 2 | GLM-4.7 (Thinking) Z AI | 41.7 | 89% | 95% | 86% |
| 3 | DeepSeek V3.2 DeepSeek | 41.2 | 86% | 92% | 86% |
| 4 | Kimi K2 Thinking Kimi | 40.3 | 85% | 95% | 85% |
| 5 | MiniMax-M2.1 MiniMax | 39.3 | 81% | 83% | 88% |
| 6 | MiMo-V2-Flash Xiaomi | 39 | 87% | 96% | 84% |
| 7 | Llama Nemotron Ultra NVIDIA | 38 | 64% | 64% | 83% |
| 8 | MiniMax-M2 MiniMax | 35.7 | 83% | 78% | 82% |
| 9 | DeepSeek V3.2 Speciale DeepSeek | 34.1 | 90% | 97% | 86% |
| 10 | DeepSeek V3.1 Terminus DeepSeek | 33.4 | 80% | 90% | 85% |
| 11 | gpt-oss-120B (high) OpenAI | 32.9 | 88% | 93% | 81% |
| 12 | GLM-4.6 Z AI | 32.2 | 56% | 44% | 78% |
Ranked by LiveCodeBench - the most representative real-world coding benchmark. These models are free to self-host and fine-tune for your codebase.
Not every model can run on consumer hardware. Here are the best open source models organized by VRAM requirements, perfect for Ollama, vLLM, or Text Generation Inference.
Run on a single consumer GPU (RTX 4060–4090) or even Apple Silicon Macs.
Need an A100/H100 or multi-GPU consumer setup. Sweet spot for quality vs. cost.
Frontier quality. Needs multi-GPU or cloud. MoE models use only a fraction of total params per token.
Keep all data on your infrastructure. No API calls to third parties. Critical for healthcare, legal, and enterprise applications.
Fine-tune for your specific use case. Modify behavior, remove guardrails, or train on proprietary data. No terms of service limitations.
At high volumes, self-hosting becomes dramatically cheaper. No per-token fees -just infrastructure costs.
February verdict: The gap is now single-digit Quality Index points for most practical tasks. Open source wins on cost and privacy; proprietary wins on multimodal and convenience.Read our full analysis →
Use hosted APIs for open models. Get the benefits of open source with the ease of SaaS.
Nebius Token Factory - High-throughput inference on H100/H200 clusters, great pricing
Together.ai - Wide selection, competitive pricing
Fireworks.ai - Fast inference, great for production
Groq - Fastest inference available, ideal for real-time apps
DeepInfra - Cost-effective, good for batch workloads
Run models on your own infrastructure for maximum control and privacy.
Ollama - Easiest local setup, great for development and prototyping
vLLM - Production-grade, excellent throughput and batching
Text Generation Inference - HuggingFace's production server
llama.cpp - CPU-friendly, works on older hardware
Use our interactive tools to compare benchmarks, pricing, and speed for all 12+ open source models side-by-side.
Kimi K2.5 (Reasoning) leads our February 2026 rankings with a Quality Index of 46.77, excelling at coding (85%) and reasoning (96%). It's completely free to download and use under an open license.
The biggest change is Kimi K2.5 (Reasoning) entering at #2 with a Quality Index of 46.77, scoring 96% on AIME 2025 and 85% on LiveCodeBench. GLM-4.7 retains #1. DeepSeek V3.2 and MiniMax-M2.1 remain strong top-tier contenders.View January rankings →
By size tier: Small (7-13B) - Gemma 3 12B, Phi-4 for general tasks. Medium (30-70B) - Qwen3 30B A3B, EXAONE 4.0 32B, DeepSeek R1 Distill Llama 70B. Large (200B+ MoE) - Qwen3-235B, DeepSeek V3.2. Use GGUF quantization to cut memory needs 50–75%.
Yes, for most tasks. Kimi K2.5 (Reasoning) achieves 85% on LiveCodeBench, matching top proprietary models on coding. Kimi K2.5 scores 96% on AIME 2025, outperforming most proprietary alternatives on math. The gap has closed to single-digit Quality Index points for practical applications.
The best free/open source coding LLMs in February 2026: DeepSeek V3.2 Speciale (90% LiveCodeBench), GLM-4.7 (Thinking) (89%), gpt-oss-120B (high) (88%). All free to self-host. Via API, providers like Together.ai and Groq offer these at $0.20–0.80/M tokens.
Llama 4 Scout supports up to 10 million tokens. For practical high-quality use, Kimi K2.5 and MiMo-V2-Flash offer 256K tokens, and GLM-4.7 provides 200K - more than enough for most document processing tasks. See our long context rankings for the full comparison.
Yes. Kimi K2.5 (Reasoning) is released by MoonshotAI under an open license. It has 1 trillion parameters with a 256K context window and can be commercially deployed, fine-tuned, and self-hosted. It debuted in our February 2026 rankings at #2 with a Quality Index of 46.77.
Quality Index comes from the Artificial Analysis Intelligence Index v4.0 - a composite score evaluating overall model capability across reasoning, coding, math, and knowledge tasks. Rankings use QI as the primary factor, with category-specific benchmarks (LiveCodeBench, AIME 2025, MMLU-Pro) as tiebreakers.
LiveCodeBench rankings
🧮AIME 2025 rankings
💰Best value per dollar
🤖Tool use rankings
📄Largest context windows
👁️Multimodal rankings
🏆Expert picks
📊Complete hub
Data sources: Rankings based on the Artificial Analysis Intelligence Index. Explore all models in our interactive explorer or compare models side-by-side.