The State of LLMs: December 2025

The benchmarks that defined progress are now meaningless. The models everyone relies on cost 30x what the alternatives do. And nobody agrees on what to measure anymore.

By Dylan Bristot16 min read

The numbers that matter

114
Models tracked
67
Open-weight
17
Score 90%+ on AIME
2
Score 40%+ on Terminal-Bench

Math is solved. Agentic coding is not. The gap between what models can memorize and what they can do has never been wider.

The leaderboard froze. The action moved elsewhere.

Three models sit at the top: Gemini 3 Pro Preview at 73 on the Artificial Analysis Intelligence Index, GPT-5.1 and Claude Opus 4.5 tied at 70. This ordering has been stable for months. Google, OpenAI, and Anthropic take turns announcing improvements, benchmark scores tick up a point or two, and nothing fundamentally changes at the summit.

The real movement is happening below. In the 60-67 range, open-weight models from Chinese labs are stacking up fast. DeepSeek V3.2 landed at 66 this week. Kimi K2 Thinking holds 67. These aren't research previews or experimental checkpoints. They're production-ready models with MIT licenses, priced at a fraction of what the leaders charge.

Here's the comparison that should concern every AI product manager:

ModelIntelligence IndexPrice/M tokensLicense
GPT-5 (medium)66$3.44Proprietary
DeepSeek V3.266$0.32Open (MIT)
o365$3.50Proprietary
Kimi K2 Thinking67$1.07Open

Same capability. 10x price difference. The MIT research estimating $24.8B in wasted spending on closed models this year doesn't seem far off.

The benchmark crisis

Look at any AI benchmark from two years ago and you'll see scores that seemed impossible at the time. MMLU-Pro? Twenty models clear 85%. AIME 2025? Seventeen models above 90%. Competition math problems that stumped graduate students are now routine.

The problem is that none of this predicts which model you should actually use.

We're watching two categories emerge in real time:

Solved benchmarks

  • AIME 2025: 17 models at 90%+
  • MMLU-Pro: 20 models at 85%+
  • GPQA Diamond: Clustering in 80s

These measure what models know. Knowledge is now cheap.

Unsolved benchmarks

  • Humanity's Last Exam: Only 1 model above 30%
  • Terminal-Bench Hard: Only 2 models above 40%
  • SciCode: Best score is 56%

These measure what models can do. Agency remains expensive.

Humanity's Last Exam was designed to be unsolvable. PhD-level problems across every discipline, with many questions where the best models perform barely above chance. Only one model clears 30%. It's the only benchmark that still has room to run.

Terminal-Bench Hard measures something closer to what people actually want: can the model operate a computer, debug code, and complete multi-step tasks autonomously? Only two models break 40%. This is where the next round of competition will play out.

The Speciale paradox

DeepSeek released something strange this week: V3.2-Speciale. It scores 97% on AIME 2025, the highest of any model. It hits 90% on LiveCodeBench. Pure reasoning ability is off the charts.

It also scores 0% on τ²-Bench Telecom, the test of tool-calling and agentic behavior. Zero. The model was deliberately stripped of that capability to maximize raw reasoning. It can solve any math problem you throw at it but can't make a function call.

This tradeoff is becoming more common. The era of "one model that does everything well" is fading. Labs are building specialists, then letting developers pick the right tool for the job. DeepSeek V3.2 for general work. Speciale for reasoning-heavy tasks. Different endpoints of the same architecture, optimized for different outcomes.

The pattern is repeating across the industry. OpenAI has GPT-5 (high), (medium), and (low) reasoning effort variants. Amazon's Nova 2.0 lineup splits into Pro, Lite, and Omni for different modality needs. The generalist flagship still exists at the top of each lineup, but it costs 10-30x more than the specialists.

The open-weight surge

Count the models: 67 open-weight versus 41 proprietary in our tracking. Open represents 59% of available options now. Two years ago, proprietary dominated. What changed?

Chinese labs went into overdrive. DeepSeek alone has seven tracked variants, each optimizing for different workloads. Moonshot AI's Kimi K2 Thinking hit 67 on the Intelligence Index while remaining fully open. Alibaba's Qwen family spans parameter counts from 3B to 235B with competitive benchmarks at each tier.

The X posts capture the mood: "Open-source won." "Kimi K2 is the new benchmark." "China won the open-weight race." The community celebration is justified. On some agentic benchmarks, Kimi K2 Thinking outperforms everything except the highest-tier GPT-5 variants.

But the skeptics have a point too. "Benchmaxxed" is the term floating around. Models optimized ruthlessly for test performance don't always translate to real-world usefulness. DeepSeek V3.2's benchmark numbers are "wild," but early testers report it feels "mid" for actual coding work. The numbers say one thing. The vibes say another.

Both camps are right. For structured tasks with clear evaluation criteria, the open-weight savings are real. For fuzzy, conversational work where "feel" matters, the differences between models become harder to quantify, and cheaper doesn't always mean better.

What Amazon and xAI are actually doing

Two companies deserve more attention than they're getting.

Amazon's Nova 2.0 lineup quietly landed some of the best agentic scores in our tracking. Nova 2.0 Pro Preview hits 93% on τ²-Bench Telecom while most frontier models hover in the 70-85% range. Amazon optimized for "AI doing real work" rather than chasing leaderboard positions on academic tests. If your use case involves tool calling, function execution, or multi-step workflows, Nova deserves serious consideration.

xAI's new Grok 4.1 Fast pushes context to 2 million tokens while scoring 93% on τ²-Bench. The focus on long-context combined with agentic capability matters for document processing and analysis workloads that were previously impractical. At $0.28 per million tokens, it's aggressively priced for what it offers.

Both companies are staking out positions in practical enterprise AI rather than competing directly on intelligence scores. Given where the benchmarks are heading, that might be the smarter play.

The provider explosion

The same model is now available from six different providers with different price points, speeds, and reliability characteristics. DeepSeek V3.1 Terminus runs through SambaNova, Novita, Fireworks, DeepInfra, and Eigen AI, each with unique quantization options and latency profiles. Kimi K2 Thinking is accessible via Moonshot, Fireworks, Parasail, Novita, and now Nebius Token Factory.

This fragmentation creates real headaches. Same model, different FP8 versus FP4 quantization, different SLAs, different burst capacity. The "best" choice depends entirely on your workload shape. High-throughput batch processing wants different infrastructure than low-latency chat.

What used to be a model decision is now a systems engineering problem. You're not just picking GPT versus Claude anymore. You're picking provider, quantization, region, and pricing tier. The model itself is becoming a commodity. The infrastructure around it is where margins will be made.

The price reality

Let's make the cost comparison concrete:

Price per million tokens (blended)

DeepSeek V3.2$0.32
Grok 4.1 Fast$0.28
Kimi K2 Thinking$1.07
GPT-5 (medium)$3.44
Claude Opus 4.5$10.00
Claude 4.1 Opus$30.00

100x spread between the cheapest and most expensive frontier models. Same ballpark capability.

Energy efficiency has improved dramatically too. UNESCO/UCL research shows small architectural tweaks reducing consumption by up to 90% in some cases. The models aren't just cheaper to run. They're cheaper to train and cheaper to deploy. The barriers to entry continue falling.

Where the problems remain

Bias issues persist. Research continues to surface models showing prejudice against non-standard dialects and demographic groups. The rush to benchmark performance hasn't been matched by equivalent attention to fairness and safety. This isn't new, but it's not improving as fast as capability scores.

Switching costs keep users on closed platforms even when open alternatives are objectively better. MIT's $24.8B waste estimate captures real organizational inertia. Enterprises that integrated GPT-4 into their stacks eighteen months ago aren't eager to rip out that integration just because DeepSeek is cheaper.

The benchmark-versus-vibes gap isn't closing. Models that crush standardized tests can feel clunky in production. The community jokes about "benchmaxxed" releases miss real use cases. Until we develop better evaluation methods for real-world performance, this disconnect will persist.

What comes next

Google's Gemini 3 full release is expected before year-end with further coding enhancements. OpenAI continues iterating on the GPT-5 family. The Chinese labs show no signs of slowing their release cadence.

But the interesting question isn't which model tops the leaderboard next. It's whether the leaderboard itself remains meaningful.

The shift is already happening. Attention is moving from MMLU and AIME toward agentic benchmarks like Terminal-Bench and τ²-Bench. The question is changing from "does the model know things" to "can the model do things." Traditional metrics have maybe six months of relevance left. The ones that survive will be the ones that test what humans actually want: not whether the model can ace a test, but whether it can get work done.

For now, the practical guidance is straightforward:

  • High-volume, cost-sensitive work: DeepSeek V3.2 or Kimi K2 Thinking at $0.32-1.07 per million tokens
  • Agentic, tool-calling workflows: Nova 2.0 Pro or Grok 4.1 Fast for best τ²-Bench performance
  • Maximum capability regardless of cost: Gemini 3 Pro Preview or GPT-5.1 (high)
  • Pure reasoning tasks: DeepSeek V3.2-Speciale while it's available (deadline December 15)

Re-evaluate every 3-6 months. The landscape moves fast enough that today's best recommendation could be obsolete by spring.

The bottom line

We're watching a market in transition. The top three models haven't changed in months, but everything below them is churning. Open-weight models matched proprietary capability at a fraction of the cost. Traditional benchmarks saturated while agentic tests remain unsolved. The next round of competition won't be about what models know. It will be about what they can do.

Data sourced from WhatLLM.org tracking of 114 models across 50+ providers. See our interactive comparison tool for the latest numbers.

Cite this analysis

If you're referencing this data in your work:

Bristot, D. (2025, December 3). The State of LLMs: December 2025. What LLM. https://whatllm.org/blog/state-of-llms-december-2025

Data sources: Artificial Analysis, WhatLLM internal tracking, December 2025