Data analysisJanuary 2, 2026

Open source vs proprietary
The January 2026 data

We compared 94 model endpoints. The gap between open source and proprietary has shrunk to 5-7 quality index points. Here's what the benchmarks actually show.

By Dylan Bristot12 min read

TL;DR

Open source top 5

  1. 68 GLM-4.7 (Z AI) — agentic leader
  2. 67 Kimi K2 Thinking (Moonshot)
  3. 66 MiMo-V2-Flash (Xiaomi) — best math
  4. 66 DeepSeek V3.2 — price/perf king
  5. 64 MiniMax-M2.1

Proprietary top 5

  1. 73 Gemini 3 Pro Preview (Google)
  2. 73 GPT-5.2 (OpenAI) — best AIME 99%
  3. 71 Gemini 3 Flash (Google)
  4. 70 Claude Opus 4.5 (Anthropic)
  5. 70 GPT-5.1 (OpenAI)
The gap: 5 points (was 12 in early 2025). Cost savings: ~85% at similar quality. Open source wins: τ²-Bench agentic (GLM-4.7 beats all at 96%).
68
Open source top QI
GLM-4.7
73
Proprietary top QI
Gemini 3 Pro
2pts
Gap closed since Oct
Was 7 points
~85%
Cost savings
At similar quality

The state of play

Three months ago, the best open source model scored 58/70 on our quality index. Today, GLM-4.7 hits 68. That's not incremental improvement. That's a shift in what open weights can deliver.

The proprietary leaders haven't stood still. Gemini 3 Pro Preview and GPT-5.2 both score 73, setting new highs. But the distance between open and closed has compressed from 12 points in early 2025 to just 5 points now.

More interesting than the overall scores: where open source models are winning specific benchmarks outright. GLM-4.7's 96% on τ²-Bench beats every proprietary model including Claude Opus 4.5 at 90%. MiMo-V2-Flash's 96% on AIME 2025 ties Gemini 3 Pro.

Top 5 open source

Ranked by quality index, January 2026

1

GLM-4.7

Z AI200K context
68
Quality Index
Agentic leader with 96% τ²-Bench
AIME 2025
95%
LiveCodeBench
89%
GPQA Diamond
86%
MMLU-Pro
86%
τ²-Bench
96%
2

Kimi K2 Thinking

Moonshot AI256K context
67
Quality Index
1T MoE with 32B active params
AIME 2025
95%
LiveCodeBench
85%
GPQA Diamond
84%
MMLU-Pro
85%
τ²-Bench
93%
3

MiMo-V2-Flash

Xiaomi256K context
66
Quality Index
Best math performance in open source
AIME 2025
96%
LiveCodeBench
87%
GPQA Diamond
85%
MMLU-Pro
84%
τ²-Bench
95%
4

DeepSeek V3.2

DeepSeek128K context
66
Quality Index
Price/performance king
AIME 2025
92%
LiveCodeBench
86%
GPQA Diamond
84%
MMLU-Pro
86%
τ²-Bench
91%
5

MiniMax-M2.1

MiniMax205K context
64
Quality Index
Strong MMLU-Pro at 88%
AIME 2025
83%
LiveCodeBench
81%
GPQA Diamond
83%
MMLU-Pro
88%
τ²-Bench
85%

Top 5 proprietary

Ranked by quality index, January 2026

1

Gemini 3 Pro Preview

Google1M context
73
Quality Index
Highest GPQA Diamond at 91%
AIME 2025
96%
LiveCodeBench
92%
GPQA Diamond
91%
MMLU-Pro
90%
τ²-Bench
87%
2

GPT-5.2

OpenAI400K context
73
Quality Index
Best AIME 2025 at 99%
AIME 2025
99%
LiveCodeBench
89%
GPQA Diamond
90%
MMLU-Pro
87%
τ²-Bench
85%
3

Gemini 3 Flash

Google1M context
71
Quality Index
Speed + quality sweet spot
AIME 2025
97%
LiveCodeBench
91%
GPQA Diamond
90%
MMLU-Pro
89%
τ²-Bench
80%
4

Claude Opus 4.5

Anthropic200K context
70
Quality Index
Best τ²-Bench among proprietary
AIME 2025
91%
LiveCodeBench
87%
GPQA Diamond
87%
MMLU-Pro
90%
τ²-Bench
90%
5

GPT-5.1

OpenAI400K context
70
Quality Index
Balanced across all benchmarks
AIME 2025
94%
LiveCodeBench
87%
GPQA Diamond
87%
MMLU-Pro
87%
τ²-Bench
82%

Head to head: benchmark breakdown

BenchmarkBest open sourceBest proprietaryGapWinner
AIME 2025 (math)
96%
MiMo-V2-Flash
99%
GPT-5.2
+3Close
LiveCodeBench (code)
89%
GLM-4.7
92%
Gemini 3 Pro
+3Close
GPQA Diamond (reasoning)
86%
GLM-4.7
91%
Gemini 3 Pro
+5Prop.
MMLU-Pro (knowledge)
88%
MiniMax-M2.1
90%
Gemini 3 Pro
+2Close
τ²-Bench (agentic)
96%
GLM-4.7
90%
Claude Opus
-6Open
1
Benchmark where open source leads
τ²-Bench: GLM-4.7 beats all
2
Benchmarks within 3 points
Math, code, knowledge
2
Clear proprietary leads
GPQA (+5), AIME (+3)

The cost picture

Performance is one axis. Cost is the other. And here the story is unambiguous: open source models through inference providers like DeepInfra, Together, and Fireworks deliver comparable quality at a fraction of the price.

Open source (via providers)

Qwen3 235Bvia Fireworks
$0.10/M
DeepSeek V3.2via DeepInfra
$0.30/M
GLM-4.7via Z AI
$0.18/M
Kimi K2via Moonshot
$0.60/M
MiMo-V2-Flashvia Xiaomi
$0.15/M

Proprietary

Gemini 3 ProGoogle
$4.50/M
GPT-5.2OpenAI
$5.00/M
Claude Opus 4.5Anthropic
$30.00/M
GPT-5.1OpenAI
$3.50/M
Gemini 3 FlashGoogle
$0.40/M

85% cost reduction is achievable when moving from GPT-5.1 at $3.50/M to DeepSeek V3.2 at $0.30/M, with a quality difference of just 4 points (70 vs 66). For many production workloads, that trade is straightforward.

What this means for your stack

Use open source when

  • Agentic workflows (GLM-4.7 leads τ²-Bench)
  • Cost-sensitive at scale (10-50x cheaper)
  • Self-hosting requirements
  • Chinese market or localization
  • Fine-tuning on proprietary data

Use proprietary when

  • Maximum quality is non-negotiable
  • 1M token context required (Gemini)
  • Enterprise compliance needs
  • Complex reasoning (GPQA Diamond edge)
  • API stability and support SLAs

The bottom line

The question is no longer whether open source models can compete with proprietary. They can. GLM-4.7 and Kimi K2 Thinking are production-ready alternatives that match or beat closed models on specific benchmarks.

The question now is which trade-offs matter for your use case. A 5-point quality gap might be irrelevant if you're saving 85% on inference. Or it might be everything if you're building a product where reasoning accuracy is the moat.

We're tracking 94 models across providers. The data changes weekly. Check ourexplorer for live benchmarks and pricing.

Methodology

Quality Index is derived from Artificial Analysis benchmarks, normalized to a 0-100 scale. Benchmark scores are sourced from official model releases and third-party evaluations (AIME 2025, LiveCodeBench, GPQA Diamond, MMLU-Pro, τ²-Bench). Pricing reflects blended input/output rates as of January 2026. Open source includes open-weights models available through major inference providers.