The state of play
Three months ago, the best open source model scored 58/70 on our quality index. Today, GLM-4.7 hits 68. That's not incremental improvement. That's a shift in what open weights can deliver.
The proprietary leaders haven't stood still. Gemini 3 Pro Preview and GPT-5.2 both score 73, setting new highs. But the distance between open and closed has compressed from 12 points in early 2025 to just 5 points now.
More interesting than the overall scores: where open source models are winning specific benchmarks outright. GLM-4.7's 96% on τ²-Bench beats every proprietary model including Claude Opus 4.5 at 90%. MiMo-V2-Flash's 96% on AIME 2025 ties Gemini 3 Pro.
Top 5 open source
Ranked by quality index, January 2026
GLM-4.7
Kimi K2 Thinking
MiMo-V2-Flash
DeepSeek V3.2
MiniMax-M2.1
Top 5 proprietary
Ranked by quality index, January 2026
Gemini 3 Pro Preview
GPT-5.2
Gemini 3 Flash
Claude Opus 4.5
GPT-5.1
Head to head: benchmark breakdown
| Benchmark | Best open source | Best proprietary | Gap | Winner |
|---|---|---|---|---|
| AIME 2025 (math) | 96% MiMo-V2-Flash | 99% GPT-5.2 | +3 | Close |
| LiveCodeBench (code) | 89% GLM-4.7 | 92% Gemini 3 Pro | +3 | Close |
| GPQA Diamond (reasoning) | 86% GLM-4.7 | 91% Gemini 3 Pro | +5 | Prop. |
| MMLU-Pro (knowledge) | 88% MiniMax-M2.1 | 90% Gemini 3 Pro | +2 | Close |
| τ²-Bench (agentic) | 96% GLM-4.7 | 90% Claude Opus | -6 | Open |
The cost picture
Performance is one axis. Cost is the other. And here the story is unambiguous: open source models through inference providers like DeepInfra, Together, and Fireworks deliver comparable quality at a fraction of the price.
Open source (via providers)
Proprietary
85% cost reduction is achievable when moving from GPT-5.1 at $3.50/M to DeepSeek V3.2 at $0.30/M, with a quality difference of just 4 points (70 vs 66). For many production workloads, that trade is straightforward.
What this means for your stack
Use open source when
- →Agentic workflows (GLM-4.7 leads τ²-Bench)
- →Cost-sensitive at scale (10-50x cheaper)
- →Self-hosting requirements
- →Chinese market or localization
- →Fine-tuning on proprietary data
Use proprietary when
- →Maximum quality is non-negotiable
- →1M token context required (Gemini)
- →Enterprise compliance needs
- →Complex reasoning (GPQA Diamond edge)
- →API stability and support SLAs
The bottom line
The question is no longer whether open source models can compete with proprietary. They can. GLM-4.7 and Kimi K2 Thinking are production-ready alternatives that match or beat closed models on specific benchmarks.
The question now is which trade-offs matter for your use case. A 5-point quality gap might be irrelevant if you're saving 85% on inference. Or it might be everything if you're building a product where reasoning accuracy is the moat.
We're tracking 94 models across providers. The data changes weekly. Check ourexplorer for live benchmarks and pricing.
Methodology
Quality Index is derived from Artificial Analysis benchmarks, normalized to a 0-100 scale. Benchmark scores are sourced from official model releases and third-party evaluations (AIME 2025, LiveCodeBench, GPQA Diamond, MMLU-Pro, τ²-Bench). Pricing reflects blended input/output rates as of January 2026. Open source includes open-weights models available through major inference providers.