The Open Source Revolution: How December 2025 Changed Everything

December by the numbers

Frontier open releases

96%

AIME 2025 (DeepSeek)

$0.10

Per million tokens

$20B

NVIDIA-Groq deal

Open-weight models matched proprietary performance at 10-30x lower cost. The economics of AI shifted permanently.

Something fundamental broke in December 2025. Not a system failure or a market crash, but something more consequential: the assumption that frontier AI required frontier capital. That the best models would always sit behind API paywalls. That open source would remain perpetually two steps behind.

The evidence arrived in waves. On December 1, DeepSeek released V3.2 with a 96% score on AIME 2025, surpassing GPT-5's 94.6%. On December 17, Xiaomi, a company known for smartphones and rice cookers, dropped MiMo-V2-Flash and landed a 67% on SWE-Bench Verified, topping every open model before it. On December 22, Z.ai's GLM-4.7 claimed the coding crown with a 95.7% AIME score while offering 200,000-token context at prices that made Claude look like a luxury tax. And on Christmas Eve, NVIDIA announced a $20 billion acquisition of Groq, signaling that the infrastructure race had entered its next phase.

This is the story of the month that rewrote the rules. Not through hype or speculation, but through benchmarks, architecture decisions, and pricing that forced everyone to reconsider what open source can achieve.

The DeepSeek V3.2 moment

When DeepSeek V3.2 landed on December 1, the reaction split into two camps. Researchers immediately noticed the numbers: 96% on AIME 2025, gold-medal performance on the International Mathematical Olympiad, 71% on SWE-Bench Verified for coding tasks. These weren't incremental improvements. They represented a model matching or exceeding the best proprietary systems on the benchmarks that matter most for reasoning and code.

The architecture tells part of the story. DeepSeek V3.2 runs 671 billion total parameters with 37 billion active through sparse Mixture-of-Experts routing. This design choice, where only a fraction of the network activates for any given token, explains how a model this capable can run inference at 150 tokens per second on optimized hardware while costing just $0.028 per million input tokens. For context, Claude Opus 4.5 charges $15 per million input tokens. GPT-5 sits around $3.50. DeepSeek is operating at a different order of magnitude.

But capability alone doesn't explain what happened in December. The real shift was the combination of three factors: benchmark-leading performance, MIT licensing that allows commercial use and modification, and pricing that makes running your own instance economically rational for any team processing more than a few hundred thousand tokens daily.

Model	AIME 2025	SWE-Bench	Price/M tokens	License
DeepSeek V3.2	96%	68%	$0.10	MIT
GPT-5.1 High	94%	70%	$3.44	Proprietary
Gemini 3.0 Pro	95%	65%	$2.50	Proprietary
Claude Opus 4.5	91%	72%	$15.00	Proprietary

The V3.2-Speciale variant pushed this further. By stripping tool-calling capabilities entirely, DeepSeek created a pure reasoning engine that hit 97% on AIME 2025, the highest score any model has achieved. The tradeoff was deliberate: Speciale scores 0% on tool-use benchmarks like τ²-Bench Telecom because it was never trained for those tasks. This isn't a limitation. It's a design philosophy that recognizes different workloads need different optimizations.

Early testers reported mixed experiences. The benchmarks showed one reality, but practical coding work sometimes felt less polished than the numbers suggested. This gap between benchmark performance and production feel has become a recurring theme in 2025, one that suggests our evaluation methods still don't capture everything that matters in real-world use. But even the skeptics acknowledged the price-performance ratio was impossible to ignore.

Xiaomi enters the frontier

If DeepSeek's release was expected, Xiaomi's was surreal. A consumer electronics company, best known for budget smartphones and smart home devices, releasing a frontier language model felt like a category error. And yet MiMo-V2-Flash exists, runs at 309 billion parameters with 15 billion active through MoE routing, and performs at levels that would have seemed impossible for a first major release.

The numbers demand attention. MiMo-V2-Flash achieved 71.7% on SWE-Bench Multilingual, the highest score for any open-weight model on software engineering tasks across multiple programming languages. It matched DeepSeek V3.2's 96% on AIME 2025. On agentic benchmarks like BrowseComp, which tests autonomous web navigation and task completion, it set new records for open models.

The pricing followed the pattern established by Chinese labs: $0.10 per million input tokens, $0.30 per million output. Running a comprehensive evaluation suite costs around $53. For comparison, running similar evaluations on Claude Opus 4.5 would cost over $500.

But MiMo's significance extends beyond benchmarks. Xiaomi built this model with embodied AI integration in mind, connecting it to their autonomous driving research and robotics initiatives. The model's efficiency, running just 15 billion active parameters while achieving frontier scores, enables deployment on edge devices and real-time systems where latency matters more than raw capability. This represents a different vision of AI development, one where the model is designed from the start for physical-world applications rather than being an API product adapted for other uses.

The weird 2025 moment

When the news broke, the reaction on X captured the strangeness of the situation: "A phone company just dropped a frontier model." The fact that this felt absurd, that we still associate cutting-edge AI with a handful of dedicated labs, reveals how quickly our assumptions have become outdated. Hardware margins are thin. Software margins are high. Every major technology company is now asking whether they should be building their own models.

GLM-4.7 claims the coding crown

Z.ai's GLM-4.7 arrived on December 22 with a specific claim: this is the best open-source model for developers. The benchmarks support the assertion. A 95.7% on AIME 2025 put it ahead of Gemini 3.0 Pro and within striking distance of GPT-5.1 High. A 67 on the Artificial Analysis Quality Index matched Claude Sonnet 4.5 for overall capability. But the real differentiator was performance on coding-specific tasks, particularly multilingual agentic coding where the model executes sequences of actions in development environments.

GLM-4.7's 200,000-token context window matters for development workflows where you need to hold entire codebases in memory while reasoning about changes. The architecture improvements from GLM-4.5 show 3-7% gains on reasoning benchmarks like GPQA Diamond and MMLU-Pro, incremental but meaningful when compounded across the evaluation suite.

The pricing structure reflects Z.ai's bet on developer adoption: $0.60 per million input tokens, $2.20 per million output. This is roughly 7x cheaper than Claude for comparable quality scores, with higher rate limits that enable burst workloads without throttling.

What makes GLM-4.7 distinct is its strength in terminal and command-line tasks. The model was trained with emphasis on understanding shell environments, file system operations, and the kind of multi-step debugging that software development actually requires. On Terminal-Bench, one of the few benchmarks that tests autonomous computer operation, GLM-4.7 outperformed models scoring higher on traditional metrics.

This points to a broader shift in how labs are approaching model development. Rather than chasing general intelligence scores, there's increasing focus on specific capability profiles that match real workloads. A model that scores 2 points lower on MMLU but handles agentic coding better is more valuable for development teams than the reverse.

NVIDIA's $20 billion bet on inference

On December 24, while most of the industry was heading into holiday mode, NVIDIA announced the largest acquisition in its history: $20 billion for Groq's assets, technology, and key personnel. The deal structure, involving licensing agreements and talent acquisition rather than a traditional corporate purchase, allowed NVIDIA to integrate Groq's inference-optimized silicon without triggering the antitrust scrutiny that a full acquisition would invite.

The timing wasn't coincidental. As open-weight models proliferated throughout 2025, the bottleneck shifted from training compute to inference efficiency. Training a model once is expensive but finite. Running inference on that model for millions of users is an ongoing cost that scales with demand. Groq's tensor streaming processor architecture, designed specifically for low-latency inference, represents exactly the capability NVIDIA needs as the industry moves from "who can train the biggest model" to "who can serve models at the lowest cost per token."

Jensen Huang's statement about "AI factories" framed the acquisition as part of a larger infrastructure buildout. NVIDIA isn't just selling GPUs anymore. They're positioning to own the full stack from training clusters to inference infrastructure, with proprietary silicon optimized for each phase. The Groq deal gives them technology that rivals custom ASICs from companies like Cerebras while maintaining the software ecosystem advantages that have made CUDA dominant.

The broader context matters. December 2025 saw $157 billion in announced AI and data infrastructure deals. Google acquired Intersect Power for $4.75 billion to secure data center energy. Meta increased its stake in Scale AI. The race isn't just for model capability anymore. It's for the physical infrastructure to run these models at scale.

What the efficiency gains actually mean

The cost numbers from December deserve closer examination. DeepSeek trained V3.2 for approximately $5.57 million. By contrast, estimates for GPT-4's training costs ranged from $50 million to over $100 million. GPT-5's training budget reportedly exceeded that significantly. The gap isn't explained by lower compute usage. It's explained by architectural efficiency and training methodology improvements that extract more capability per dollar spent.

Sparse Mixture-of-Experts is central to this efficiency. By routing each token through a subset of specialized networks rather than the full parameter count, MoE architectures achieve the representational capacity of massive models with the computational cost of much smaller ones. DeepSeek V3.2 activates 37 billion parameters for inference despite having 671 billion total. This 18x ratio between total and active parameters is aggressive compared to earlier MoE designs, and it's enabled by improved routing algorithms that better match tokens to relevant expert networks.

The inference cost story is equally dramatic. Running DeepSeek V3.2 at $0.028 per million input tokens means that a full evaluation suite, comprehensive testing across reasoning, coding, and language tasks, costs around $54. The same evaluation on GPT-5 would cost over $300. On Claude Opus 4.5, closer to $800. These aren't minor differences. They represent order-of-magnitude changes in what experimentation and deployment cost.

Cost comparison: running a full evaluation suite

DeepSeek V3.2$54

Xiaomi MiMo-V2-Flash$53

GLM-4.7$180

GPT-5.1 High$340

Claude Opus 4.5$820

Based on standardized evaluation across reasoning, coding, and language benchmarks. Actual costs vary with token counts.

For organizations processing significant token volumes, the math has shifted decisively. At 100 million tokens monthly, the difference between DeepSeek and Claude represents over $14,000 in annual savings with comparable quality on most benchmarks. At a billion tokens monthly, that becomes $140,000. These numbers are large enough to fund additional engineering headcount, which for many teams represents a better investment than marginal improvements in model capability.

The reasoning paradigm shift

Andrej Karpathy's year-in-review, posted on December 19, articulated a conceptual shift that December's releases exemplified. The important developments in 2025, he argued, weren't about scaling up parameter counts or training data. They were about changing how models reason at inference time.

Reinforcement Learning with Verifiable Rewards, or RLVR, represents one piece of this shift. Unlike traditional RLHF where human preferences shape model outputs, RLVR uses programmatically verifiable outcomes to provide reward signals. A model generating code can be evaluated by whether the code runs correctly. A model solving math problems can be checked against ground-truth answers. This removes human bottlenecks from the training loop and enables much larger-scale preference learning.

The December models showcase this approach in practice. DeepSeek V3.2's hybrid "thinking/non-thinking" modes allow it to allocate more compute to difficult problems while handling simple queries efficiently. GLM-4.7's emphasis on terminal operations suggests training that incorporated execution feedback, not just text prediction. MiMo-V2-Flash's strength in agentic tasks points to similar methodology where action outcomes shaped model behavior.

This has implications for where AI capability goes next. Traditional scaling laws suggested that doubling parameters and data would yield predictable capability gains. Those returns are diminishing. But test-time compute, spending more inference cycles on harder problems, offers a new scaling dimension that isn't subject to the same limits. A model that can "think longer" on difficult problems while remaining efficient on easy ones can match larger models on capability while maintaining cost efficiency.

The benchmark saturation problem

By the end of December, seventeen models scored above 90% on AIME 2025. Twenty exceeded 85% on MMLU-Pro. The benchmarks that defined progress for years have stopped differentiating frontier systems. When everyone aces the test, the test stops telling you anything useful.

New evaluation frameworks are emerging to address this. Humanity's Last Exam, designed explicitly to resist saturation, tests PhD-level problems across disciplines where current models perform near chance levels. Only one model has cleared 30%. Terminal-Bench Hard measures autonomous computer operation, testing whether models can navigate file systems, debug code, and complete multi-step tasks without human intervention. Only two models exceed 40%.

These harder benchmarks reveal capability gaps that traditional metrics miss. Models that appear equivalent on AIME diverge significantly on agentic tasks. The "jagged intelligence" pattern Karpathy described, where models exhibit superhuman performance on some tasks while failing basic reasoning on others, persists even at the frontier.

For practical purposes, this means benchmark scores have become less useful for model selection. A 96% versus 94% on AIME doesn't predict which model will work better for your codebase. The community has started running task-specific evaluations, testing models on representative samples of their actual workloads rather than relying on standardized metrics. This is more labor-intensive but yields more actionable results.

China's open-source strategy

December capped a year in which China filed over 700 generative AI models with regulatory authorities. November alone saw fifteen major open-weight releases from Chinese labs. The strategy behind this surge goes beyond technical capability. It represents a deliberate approach to AI ecosystem development that differs fundamentally from the Western model.

American frontier labs, led by OpenAI and Anthropic, have primarily monetized through API access. The model weights remain proprietary. Revenue comes from inference charges. This creates ongoing income streams but also limits how customers can use the technology. You can't fine-tune a model you don't have the weights for. You can't deploy it on your own infrastructure. You can't modify it for specialized use cases.

Chinese labs chose differently. DeepSeek, Alibaba's Qwen, Z.ai's GLM, and now Xiaomi's MiMo all release under MIT or Apache licenses that permit commercial use, modification, and redistribution. The monetization strategy appears to focus on infrastructure services, consulting, and integration work rather than per-token charges for the base models.

This creates different adoption dynamics. When a model is truly open, it gets deployed in contexts the original developers never anticipated. It gets fine-tuned on private datasets. It gets run on local hardware. Each of these uses extends the ecosystem and creates dependencies that benefit the model's creators even without direct revenue from model access.

By the end of 2025, the volume play appears to be working. Open-weight models from Chinese labs matched proprietary American models on most benchmarks while costing 10-30x less to run. The question for 2026 isn't whether open source can compete. It's whether proprietary models can justify their premium for anything beyond the most demanding edge cases.

What NVIDIA's move signals

The Groq acquisition makes more sense in the context of open-weight proliferation. When anyone can download frontier model weights, competitive advantage shifts from having the model to serving it efficiently. Training compute matters less. Inference efficiency matters more.

Groq's tensor streaming architecture processes tokens differently than traditional GPU designs. Rather than batching operations and managing memory hierarchies, the TSP dedicates silicon to deterministic, low-latency execution. This makes it exceptional for inference workloads where consistent response times matter more than peak throughput. The tradeoff is flexibility. Groq hardware is optimized for a narrower range of operations than general-purpose GPUs.

For NVIDIA, acquiring this technology hedges against multiple futures. If proprietary models maintain their lead, NVIDIA's training hardware remains essential. If open-weight models dominate, NVIDIA now owns a leading inference platform. Either way, they're positioned in the silicon layer where margin remains healthy.

The acquisition also signals concern about alternatives. Custom AI accelerators from Cerebras, Graphcore, and Amazon's Trainium represent genuine competition for specific workloads. Groq had proven that purpose-built inference silicon could dramatically outperform GPUs on latency-sensitive tasks. Bringing that capability in-house removes a competitor while adding capabilities that complement NVIDIA's existing portfolio.

The energy dimension

Underneath the model releases and acquisitions, another story played out in December. Google's $4.75 billion acquisition of Intersect Power, focused on renewable energy for data centers, highlighted how AI's energy demands are reshaping infrastructure investment. The Stargate project, a multi-hundred-billion-dollar initiative for AI data center construction, continued to attract capital even as market volatility hit other technology investments.

The efficiency gains in December's models matter in this context. DeepSeek V3.2 achieves its benchmark scores while activating a fraction of its total parameters. This means proportionally lower energy per inference. When multiplied across billions of requests, the difference between 37 billion active parameters and 671 billion represents substantial energy savings.

MoE architectures and test-time compute scaling both point toward a future where AI efficiency improves faster than demand grows. The AI Index Report for 2025 found that inference costs dropped 280x over the past year while usage grew 31x. The net result was dramatically lower total cost for equivalent capability. This trajectory, if maintained, suggests AI's energy footprint may grow more slowly than pessimistic projections indicated.

But the base demand continues rising. Each new capability unlocks new applications. Agents that can browse the web, write code, and execute multi-step tasks consume far more tokens per session than simple question-answering. As models become more capable, they get used for more complex tasks that require more compute. The efficiency gains buy time, but they don't eliminate the infrastructure buildout requirements.

Practical guidance for January

For developers and organizations evaluating models at the start of 2026, December's releases scrambled the previous calculus. Here's how to think about selection based on specific workloads:

Math-heavy reasoning tasks

Best choice: DeepSeek V3.2-Speciale

97% AIME 2025, but no tool-calling capability. Pure reasoning at the lowest cost. Use the standard V3.2 if you need agentic features.

Software development workflows

Best choice: GLM-4.7

200K context holds full codebases. Strong terminal and debugging performance. 7x cheaper than Claude at comparable quality.

Agentic and autonomous tasks

Best choice: MiMo-V2-Flash or DeepSeek V3.2

Both lead on SWE-Bench and agentic benchmarks. MiMo edges ahead on multilingual tasks. DeepSeek wins on pure math.

Maximum quality regardless of cost

Best choice: GPT-5.1 High or Claude Opus 4.5

Still lead on some benchmarks and polish metrics. Worth the premium only for production systems where marginal quality improvement matters.

The general recommendation has shifted. For most workloads, start with open-weight models and only escalate to proprietary if specific capability gaps emerge. The cost savings fund more experimentation, more fine-tuning, and more infrastructure investment. This reverses the previous default where teams started with GPT or Claude and looked for cheaper alternatives only when budget pressures demanded it.

What comes next

December's releases don't represent a new equilibrium. They represent a phase transition that continues to unfold. Several dynamics will shape the first quarter of 2026:

The provider layer is fragmenting. The same DeepSeek V3.2 weights now run through SambaNova, Novita, Fireworks, DeepInfra, and Eigen AI, each with different quantization options, latency profiles, and pricing. Picking a model is no longer sufficient. You're now picking model, provider, quantization, and deployment region. This adds complexity but also creates arbitrage opportunities as providers compete on margins.

Benchmark development is accelerating to match capability growth. Expect new evaluation frameworks that test multi-session memory, persistent task execution, and real-world agentic performance. The benchmarks that exist today will be deprecated within a year as ceiling effects make them uninformative.

Fine-tuning infrastructure is maturing. The value of open weights depends on being able to adapt them to specific domains. Services like Hugging Face, Modal, and specialized fine-tuning platforms are making this accessible to teams without dedicated ML infrastructure. This extends open-source advantages beyond cost into capability customization that proprietary models can't match.

Energy and compute constraints will increasingly shape development. The efficiency gains in December's models weren't accidental. They reflect deliberate architectural choices driven by awareness that raw scaling has hit practical limits. Expect continued focus on test-time efficiency, sparse activation, and inference optimization rather than simple parameter growth.

The bottom line

December 2025 wasn't just another month of model releases. It was the month open-source AI reached undeniable parity with proprietary systems on the metrics that matter most. DeepSeek, GLM, and Xiaomi proved that frontier capability doesn't require frontier capital. The moat that protected closed models, the assumption that only billion-dollar training runs could produce state-of-the-art results, broke permanently. What happens in 2026 builds on this new foundation, where open weights are the default and proprietary models need to justify their premium in specific, measurable ways.

Data sourced from WhatLLM.org tracking, Artificial Analysis benchmarks, and official model documentation. See our interactive comparison tool for the latest numbers across all tracked models.

Cite this analysis

If you're referencing this content in your work:

Bristot, D. (2025, December 26). The open source revolution: How December 2025 changed everything. What LLM. https://whatllm.org/blog/open-source-revolution-december-2025

Sources: Artificial Analysis, DeepSeek technical reports, Z.ai documentation, Xiaomi MiMo release notes, December 2025

The open source revolution: how December 2025 changed everything