Open source vs proprietary LLMs: complete 2025 benchmark analysis
TL;DR: The state of LLMs in late 2025
The landscape has shifted dramatically:
- Open source dominates by volume: 63% of models in our dataset (59 open source vs 35 proprietary)
- Performance gap closing fast: Best open source model (MiniMax-M2, quality 61) trails best proprietary (GPT-5, quality 68) by just 7 points, down from 15-20 points in 2024
- Cost advantage is massive: Open source averages $0.83 per million tokens vs $6.03 for proprietary (86% savings, or 7.3x cheaper)
- Speed advantage: Open source models on optimized infrastructure average 179 tokens/sec vs 138 for proprietary, with peaks exceeding 3,000 tokens/sec
- Production-ready options: 9 open source models score 50+ quality (vs 19 proprietary), making them viable for most professional use cases
Key takeaways:
- Chinese labs (DeepSeek, Qwen, GLM) are disrupting with high-quality models at $0.25-$0.53/M tokens
- Proprietary still leads for absolute best performance (elite tier), but the window is narrowing
- For 80% of use cases, open source now offers better value without meaningful quality sacrifice
- At current pace, open source will achieve parity with today's best proprietary models by mid-2026
- Hybrid strategies (open source for volume, proprietary for critical tasks) are optimal for most organizations
I analyzed 94 leading large language models from the Artificial Analysis leaderboards, spanning 329 different API provider endpoints, to understand where we really stand in the open source versus proprietary debate. The results surprised me, and they'll likely change how you think about choosing an LLM for your next project.
The dataset: what we're working with
Before diving into the findings, let's establish what we're analyzing. This data comes from the Artificial Analysis LLM leaderboards, which track and benchmark the performance of leading AI models and their API providers:
- 94 unique LLM models from major labs including OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba (Qwen), Zhipu AI (GLM), and others
- 329 total API endpoints across different providers (the same model can be offered by multiple providers like AWS Bedrock, Azure, Groq, Fireworks, Together AI, etc., often at different prices and speeds)
- Comprehensive metrics including quality scores, pricing per million tokens, output speed (tokens/second), latency, context windows, and detailed benchmarks
- Recent data reflecting the state of LLMs in late October 2025
Important note: This analysis focuses on models tracked by Artificial Analysis, which emphasizes production-ready LLMs with API access. It doesn't include all open source models (many experimental or research-only models aren't represented), but it does cover the most widely-used and commercially-available options from both camps.
Each model is evaluated on a quality index based on rigorous benchmarks like GPQA Diamond (PhD-level reasoning), AIME 2025 (advanced mathematics), LiveCodeBench (coding ability), and MMLU-Pro (general knowledge). The scores in our dataset range from 0 to 68, with higher scores indicating better overall performance. Think of it like a GPA for AI models, where the top performers currently score in the high 60s.
The big picture: open source dominates by volume
The first striking finding is that open source models now represent 62.8% of the market by model count:
This shift is dramatic. Just two years ago, proprietary models dominated the landscape. What changed? Chinese AI labs went into overdrive, releasing models like DeepSeek V3, Qwen3, and GLM-4 at an unprecedented pace. Meta's continued commitment to open source Llama models has also been pivotal. The barrier to entry for quality AI has essentially collapsed.
Quality analysis: the gap is closing fast
Here's where it gets interesting. While proprietary models still hold the lead in absolute quality, the gap is narrowing rapidly:
| Metric | Open Source | Proprietary |
|---|---|---|
| Average Quality | 31.9 | 48.0 |
| Median Quality | 29.0 | 51.0 |
| Quality Range | 0 - 61 | 11 - 70 |
| Top Performer | MiniMax-M2 (61) | GPT-5.1 (High) (70) |
The critical number here is 9 points—that's the gap between the best open source model (MiniMax-M2 at 61) and the new proprietary leader (GPT-5.1 (High) at 70). In October 2024, this gap was around 15-20 points. At the current rate of improvement, we're looking at parity by Q2 2026.
What "quality index" actually means
When I say a model has a quality index of 61 versus 68, what does that actually translate to in real-world use? Here's a practical breakdown:
- 60+ (elite tier): can handle PhD-level reasoning, solve advanced mathematics (like AIME competition problems), and write production-ready code with minimal errors
- 50-59 (high tier): excellent for most professional use cases including content generation, analysis, coding assistance, customer service
- 40-49 (medium tier): solid for everyday tasks but may struggle with highly complex reasoning or specialized domains
- 30-39 (low tier): good for basic tasks, simple Q&A, and straightforward content generation
- <30 (basic tier): experimental or highly specialized models, often useful for niche applications
Top performers: open source is competitive
The top 5 open source models are now genuinely impressive:
- MiniMax-M2 - Quality: 61
- GPT-OSS-120B - Quality: 58
- DeepSeek V3.1 Terminus - Quality: 58
- Qwen3 235B A22B - Quality: 57
- DeepSeek V3.2 Exp - Quality: 57
Meanwhile, the top proprietary models are:
- GPT-5.1 (high) - Quality: 70
- GPT-5 Codex (high) - Quality: 68
- GPT-5 (high) - Quality: 68
- GPT-5 (medium) - Quality: 66
- o3 - Quality: 65
The cost equation: open source is 7.3x cheaper
Here's where open source absolutely dominates. When it comes to pricing per million tokens (that's roughly 750,000 words of text):
Real-world cost examples
Let's make this concrete. Say you're building a customer service chatbot that processes 10 million tokens per month (about 7.5 million words, equivalent to roughly 150 full-length novels):
- With Qwen3-235B (open source, quality 57): $2.50/month
- With Claude 4.5 Sonnet (proprietary, quality 63): $60/month
- With GPT-5 (proprietary, quality 68): $34.40/month
You're getting 84% of GPT-5's quality at 7% of the cost with Qwen3. That's the disruption we're seeing.
Value analysis: quality per dollar
This is where things get really interesting. I calculated a "value score" for each model by dividing its quality index by its average price per million tokens. Think of it as "how much AI intelligence am I getting for each dollar I spend?"
A score of 600 means you're getting 600 points of quality for every dollar. To put this in perspective:
- If a model costs $0.10 per million tokens and has a quality score of 60, its value score is 60 ÷ 0.10 = 600
- If a model costs $6.00 per million tokens and has a quality score of 63, its value score is 63 ÷ 6.00 = 10.5
The higher the value score, the better bang for your buck. Here are the champions:
Best value open source models
- Phi-4 Mini / Phi-4 Multimodal - Infinity (they're free!)
- Gemma 3 4B - 600 quality per dollar
- NVIDIA Nemotron Nano 9B V2 - 529 quality per dollar
- Qwen3-235B A22B - 228 quality per dollar
- DeepSeek V3.1 Terminus - 129 quality per dollar
Best value proprietary models
- GPT-5 nano (high) - 364 quality per dollar
- Nova Micro - 300 quality per dollar
- Gemini 2.5 Flash-Lite - 282 quality per dollar
Notice something? Even the best value proprietary models can't compete with free (Phi-4) or ultra-cheap open source options. This is why open source is becoming the default choice for most use cases.
Speed matters: open source actually wins
One area where I expected proprietary to dominate was speed. I was wrong.
| Metric | Open Source | Proprietary |
|---|---|---|
| Average Speed | 179 tokens/sec | 138 tokens/sec |
| Median Speed | 88 tokens/sec | 115 tokens/sec |
| Maximum Speed | 3,087 tokens/sec | 616 tokens/sec |
Open source models running on optimized infrastructure (providers like Groq, Fireworks AI, Together AI, and Nebius Token Factory) can achieve 5x faster speeds than the fastest proprietary options. For real-time applications like chatbots, autocomplete, or interactive assistants, this is game-changing.
Context windows: parity achieved
Context window (how much text a model can "remember" at once) used to be a proprietary advantage. Not anymore.
- Open source average: 412,000 tokens (roughly 300 novels)
- Proprietary average: 468,000 tokens (roughly 350 novels)
The largest context windows:
- Open source: Llama 4 Scout (10 million tokens), MiniMax-Text-01 (4 million), MiniMax M1 40k (1 million)
- Proprietary: Grok 4 Fast (2 million), Claude 4.5 Sonnet (1 million), Gemini 2.5 Pro (1 million)
Context window is no longer a differentiator. Open source caught up and, in some cases, surpassed proprietary offerings.
Quality tier distribution: where each camp excels
When we break down models by quality tier, a clear pattern emerges:
| Quality Tier | Open Source | Proprietary | Winner |
|---|---|---|---|
| Elite (60+) | 1 model | 11 models | 🔒 Proprietary |
| High (50-59) | 8 models | 8 models | 🤝 Tied |
| Medium (40-49) | 11 models | 7 models | 🔓 Open Source |
| Low (30-39) | 9 models | 3 models | 🔓 Open Source |
| Basic (<30) | 30 models | 6 models | 🔓 Open Source |
The pattern: Proprietary dominates at the very top (elite tier), but open source floods the market at every other level, especially in the "production-ready" high tier where they're tied 8-8.
Head-to-head: the top 10 models overall
When we rank all models by quality regardless of license, here's the top 10:
| Rank | Model | Type | Quality | Price/M | Value |
|---|---|---|---|---|---|
| 1 | GPT-5.1 (high) | 🔒 Proprietary | 70 | $3.44 | 20 |
| 2 | GPT-5 Codex (high) | 🔒 Proprietary | 68 | $3.44 | 20 |
| 3 | GPT-5 (high) | 🔒 Proprietary | 68 | $3.44 | 20 |
| 4 | GPT-5 (medium) | 🔒 Proprietary | 66 | $3.44 | 19 |
| 5 | o3 | 🔒 Proprietary | 65 | $3.50 | 19 |
| 6 | Grok 4 | 🔒 Proprietary | 65 | $8.50 | 8 |
| 7 | GPT-5 mini (high) | 🔒 Proprietary | 64 | $0.69 | 93 |
| 8 | Claude 4.5 Sonnet | 🔒 Proprietary | 63 | $6.00 | 11 |
| 9 | GPT-5 (low) | 🔒 Proprietary | 62 | $3.44 | 18 |
| 10 | MiniMax-M2 | 🔓 Open Source | 61 | $0.53 | 115 |
The breakthrough: MiniMax-M2 is the only open source model in the top 10, but it offers 5.6x better value than comparable proprietary options at the same quality level. That's the inflection point we're witnessing.
Strategic insights: what this all means
1. The quality gap is narrowing at record speed
In October 2024, the gap between the best open source and proprietary models was 15-20 quality points. Today it's just 9 points. At this rate, we'll see parity by mid-2026. Chinese labs like DeepSeek and Alibaba (Qwen) are iterating faster than anyone anticipated.
2. Cost efficiency makes open source the default choice
With 86% average cost savings, open source has become the economically rational choice for most use cases. Even if you need the absolute best quality, it's hard to justify paying $6/M for a 63-quality proprietary model when you can get a 57-quality open source model for $0.35/M. That's 17x cheaper for 90% of the capability.
3. Speed advantage is underrated
The fact that open source models on optimized infrastructure can hit 3,000+ tokens per second (versus 600 for proprietary) is a massive advantage for latency-sensitive applications. If you're building a real-time chatbot, autocomplete feature, or interactive assistant, open source on Groq or Fireworks is the clear winner.
4. Chinese labs are disrupting the market
Models like DeepSeek V3.1 (quality 58, price $0.45), Qwen3-235B (quality 57, price $0.25), and GLM-4.6 (quality 56, price $0.88) represent the "iPhone moment" for LLMs: high quality made accessible. This is forcing Western labs to compete on price and accelerating innovation across the board.
5. Proprietary still dominates elite use cases
If you absolutely need the best, for tasks like solving competition-level math problems (GPT-5.1 (High) hits 94% on AIME 2025 and 87% on LiveCodeBench), cutting-edge reasoning, or mission-critical production code, proprietary still has the edge. But that edge is shrinking fast, and it comes at a significant cost premium.
6. The "sweet spot" has shifted dramatically
In 2024, the best value models were proprietary "lite" versions like GPT-4o-mini and Claude Haiku (quality around 40-45, price around $0.50-2.00/M). In 2025, the sweet spot is firmly in open source territory: Qwen3-235B, DeepSeek V3.2, and Llama 3.3 70B offer quality scores of 50-57 at prices of $0.17-0.42/M. That's a quantum leap in value.
Recommendations by use case
Based on this analysis, here's what I recommend for different scenarios:
🤖 Production Chatbot / Customer Service
Recommended: Qwen3-235B ($0.25/M, Quality: 57)
Why: Excellent quality-to-price ratio, 256K context window for long conversations, fast inference
💻 Code Generation / Coding Assistant
Recommended: DeepSeek V3.2 Exp ($0.35/M, Quality: 57)
Why: 83% on LiveCodeBench, optimized for coding tasks, extremely cheap for the quality
🎯 Maximum Quality (Mission-Critical)
Recommended: GPT-5.1 (High) (Quality: 70) or GPT-5 Codex (Quality: 68)
Why: Still unmatched at the absolute top for complex reasoning, advanced coding, and specialized tasks
🆓 Free Tier / Experimentation
Recommended: Phi-4 Mini, Phi-4 Multimodal, or Gemma 3
Why: Completely free, solid quality for basic tasks, perfect for prototyping
⚡ Speed-Critical / Real-Time Applications
Recommended: Llama 3.3 70B on Groq (250+ tokens/sec, Quality: 48)
Why: Blazing fast inference, good quality, low latency
📚 Long Context / Document Analysis
Recommended: MiniMax-M2 (205K context, Quality: 61, $0.53/M)
Why: Best open source elite model with massive context window at reasonable price
💰 Budget-Conscious / High Volume
Recommended: DeepSeek or Qwen3 family (various sizes)
Why: 86% cheaper than proprietary alternatives with comparable quality
Looking ahead: 2026 predictions
Based on current trajectories, here's what I expect to see in 2026:
- Quality parity: open source will match or exceed the current GPT-5.1 quality level (70) by Q2 2026. DeepSeek V4, Llama 5, or Qwen4 are likely candidates to hit this milestone.
- Proprietary pivot: proprietary labs will shift focus to ultra-specialized reasoning models (like o4, o5) or multimodal dominance where they can maintain an edge.
- Price collapse: we'll see sub-$0.10/M pricing for 50+ quality models as competition intensifies and infrastructure costs drop.
- Provider consolidation: infrastructure providers (Nebius, Fireworks, Together AI, Groq) will become more valuable than model creators, similar to how cloud providers became more important than Linux distros.
- Enterprise adoption: open source will cross 50% market share for production workloads as enterprises prioritize cost control and customization.
The bottom line
Open source LLMs have achieved "good enough" quality for approximately 80% of real-world use cases while costing 86% less than proprietary alternatives. Proprietary models retain a narrow lead for elite tasks (that top 20%), but the window is closing rapidly.
The smart strategy for most organizations is a hybrid approach:
- Use open source for high-volume, cost-sensitive workloads (customer service, content generation, basic coding, Q&A)
- Reserve proprietary for critical edge cases where absolute best quality matters (complex reasoning, mission-critical code, specialized analysis)
- Continuously re-evaluate as open source improves (likely every 3-6 months given the current pace)
We're witnessing a fundamental shift in the AI landscape. The question is no longer "Can open source compete?" but rather "Where does proprietary still justify its premium?" That's a remarkable transformation in just 18 months.
Want to explore this data yourself? Check out our interactive LLM comparison tool where you can filter by price, quality, speed, and more across all 94 models.
📚 Cite this analysis
If you're referencing this data in your work, please use the following citation:
Bristot, D. (2025, October 28). Open source vs proprietary LLMs: Complete 2025 benchmark analysis. What LLM. https://whatllm.org/blog/open-source-vs-proprietary-llms-2025
Data sources: Artificial Analysis (LLM Leaderboard & API Providers Leaderboard, October 2025)