The state of AI at the end of 2025: reasoning won, agents arrived, and the race got tighter
A synthesis of Artificial Analysis's 2025 Year-End State of AI report, what changed, what it means, and what to watch in 2026.
2025 in five numbers
If you read a "state of AI" piece a year ago, you probably saw the same caveat repeated everywhere: progress is slowing, scaling is hitting a wall, the easy gains are over.
Twelve months later, that take has aged like milk.
At the start of 2025, OpenAI's o1 was the only "reasoning" model on the market and coding agents barely existed. By the end of the year, every major lab had reasoning models occupying the top of every intelligence leaderboard, software engineers were instructing autonomous agents that work for minutes at a time, and the price of GPT-4 level intelligence had collapsed by roughly 100x.
Artificial Analysis just published their 2025 Year-End State of AI report, a 34-page synthesis of frontier model benchmarks, market structure, hardware shifts, and the modalities that broke through. Below is my read of what mattered, organized around the six trends that actually shaped the year.
1. Reasoning models became the status quo
This is the single most important shift of 2025, and it happened faster than almost anyone predicted.
The Artificial Analysis Intelligence Index v4.0, which aggregates ten evaluations including GPQA Diamond, Humanity's Last Exam, τ²-Bench Telecom, Terminal-Bench Hard, and SciCode, tells the story cleanly. The top of the leaderboard at the end of 2025:
| Rank | Model | Lab | Index |
|---|---|---|---|
| 1 | GPT-5.2 (xhigh) | OpenAI | 51 |
| 2 | Claude Opus 4.5 | Anthropic | 50 |
| 3 | GPT-5.2 Codex (xhigh) | OpenAI | 49 |
| 4 | Gemini 3 Pro Preview (high) | 48 | |
| 5 | Kimi K2.5 | Moonshot AI | 47 |
| 6 | Gemini 3 Flash | 46 | |
| 7 | Claude 4.5 Sonnet | Anthropic | 43 |
| 8 | GLM-4.7 | Z.ai | 42 |
| 9 | DeepSeek V3.2 | DeepSeek | 42 |
| 10 | Grok 4.1 | xAI | 41 |
Every model in the top ten is a reasoning model. Twelve months ago, only one was.
The mechanism is simple but transformative: models now spend output tokens "thinking" before they answer. That delivered measurable gains on general reasoning, scientific reasoning, long-horizon agentic tasks, and coding, but it also expanded the average workload size dramatically, since reasoning models generate roughly 10x more output tokens per query.
OpenAI began and ended 2025 with the most capable language model, but their lead is narrower than ever. Anthropic, Google, xAI, and a growing roster of Chinese labs (Moonshot, Z.ai, DeepSeek, MiniMax) are all credibly competitive at the frontier.
2. The cost of intelligence collapsed, even as compute demand exploded
Here's the paradox that defined 2025: per-token costs fell faster than at any point in the field's history, and total compute demand accelerated anyway.
The price per token for o1-level intelligence, which sat around $32 per million tokens at the start of 2025, fell 128x by year-end. GPT-4 level intelligence is now roughly 100x cheaper than the original GPT-4. The Pareto frontier of intelligence-vs-cost shifted left and up so dramatically that the "most attractive quadrant" of the chart is now densely populated with options that didn't exist twelve months ago.
Three forces that drove cost down
- Smaller models with sparsity, algorithmic and training-data improvements (~1/10x compute)
- Software efficiency, Flash Attention, vLLM, SGLang, TensorRT-LLM (~1/3x compute)
- Hardware efficiency, Blackwell-class accelerators (~1/3x cost)
Three forces that pushed demand up
- Larger frontier models, scaling still demands more parameters at the absolute frontier (~5x compute/query)
- Reasoning models, significant token expansion when models "think" (~10x output tokens)
- AI agents, multi-step workflows chain dozens of LLM calls (~20x requests per use case)
A single deep-research query in late 2025 can consume more compute than 10 original GPT-4 queries combined. Cheaper per token, vastly more tokens, larger total bill.
3. Coding agents went from idea to indispensable
At the start of 2025, software engineering meant copy-pasting code into ChatGPT or Cursor Chat. By year-end, the workflow had inverted: developers describe a goal, an agent works autonomously for minutes (sometimes longer), and engineers review the diff.
Tool calling is now baked into pre-training and reinforcement learning for nearly every model released in 2025. Long-horizon coding tasks were the largest beneficiaries, and the proliferation of coding agents, Claude Code, Codex, Cursor agents, Cowork, and dozens of smaller players, has made the agentic loop the default rather than the novelty.
Token efficiency is the new differentiator
A subtle but important finding from the report's GDPval-AA benchmark: in agentic workflows, more output tokens don't translate to higher intelligence. What matters is using tools effectively. Google's and Anthropic's leading models sit on the Pareto frontier for token efficiency on long-horizon agentic tasks, while several models burn through far more tokens to reach lower scores. Efficiency of tool use is becoming a real axis of differentiation.
The forecast for 2026 from Artificial Analysis: coding agents are the proof of concept. Next year is "agents for everything else."
4. The race got more contested, not less
A common 2024 prediction was that the AI race would consolidate around two or three labs. The opposite happened.
The Intelligence Index leaderboard at year-end shows credible frontier-class models from at least eight distinct organizations, with strong reasoning models from Chinese labs (Moonshot, Z.ai, DeepSeek, MiniMax, Alibaba), Korean labs (LG AI Research, SK Telecom, Naver, Upstage), and a steady drumbeat of releases from Mistral and others.
China
Beijing has emerged as the densest concentration of frontier AI startup activity globally, ByteDance, Zhipu, Moonshot, Xiaomi, Kuaishou, iFlytek, Meituan, Baidu and Kunlun all cluster there. Shanghai, Hangzhou (Alibaba, DeepSeek), and Shenzhen (Tencent, Huawei, DJI) round out a remarkably distributed ecosystem.
Korea
The government-backed Sovereign AI Initiative, a multi-stage national competition with direct funding and GPU access for winners, produced multiple near-frontier labs. LG AI Research's K-EXAONE (236B open-weights reasoning), SK Telecom's A.X K1 (500B open-weights), Naver's HyperCLOVA X SEED Think, and Upstage's Solar Open all shipped in 2025.
Open weights
The open-to-proprietary intelligence gap held roughly steady through the year, but the composition shifted: the most capable open-weights models (Kimi K2.5, Kimi K2 Thinking, gpt-oss-120B, DeepSeek V3.2) are predominantly from Chinese labs. OpenAI's first open-weights model since GPT-2, gpt-oss-120B, pushed the open frontier meaningfully forward, but proprietary models still lead the absolute frontier.
Meta & vertical integration
The report flags one notable absence: Meta hasn't released a new model since Llama 4 Maverick in April 2025, and has restructured its AI efforts. Google remains the most vertically integrated player, TPUs at the bottom, Gemini at the top, Search and Workspace as surface area. Anthropic signed multi-year deals with both Google (TPU) and Amazon (Trainium) for training and inference in 2025.
5. Image, video, and voice broke through to mainstream
The headline modalities all crossed real quality thresholds in 2025.
Image. GPT Image 1.5 ended 2025 roughly 150 ELO points ahead of FLUX1.1 [pro] Ultra (the leader at end of 2024). Image editing, instruction-based, multi-image, character-consistent, was arguably the bigger story than text-to-image. OpenAI's GPT-4o Image and Google's Nano Banana (Gemini 2.5 Flash) drove a step-change in usage and mindshare.
Video. Runway Gen-4.5 ended 2025 about 200 ELO points ahead of OpenAI's Sora (the year-ago leader). Image-to-video, with character references that hold across shots, drove real consumer adoption. Veo 3 (May 2025) was the first mainstream model to natively generate audio with video, Sora 2, LTX-2, Wan 2.6, and Seedance 1.5 Pro followed quickly.
Speech. Native speech-to-speech models matured into the foundation for voice agents. xAI took the lead on the Big Bench Audio benchmark while delivering fast inference; AWS Nova 2.0 Sonic emerged as the price-performance leader. ElevenLabs' Scribe v2 Realtime and NVIDIA's Parakeet Realtime pushed STT into ultra-low-latency territory. Voice agents now reach near-human quality in structured interactions, though ambiguous, multi-turn, and noisy conditions remain unsolved.
China-US parity in media generation. Unlike language models, where the US still leads the absolute frontier, image and video generation is genuinely at parity. ByteDance's Seedream 4.5 competes head-to-head with Nano Banana Pro and GPT Image 1.5; Kling 2.5 Turbo is competitive with Veo 3.1 and Runway Gen-4.5.
6. The hardware story: Blackwell shipped, NVIDIA bought Groq
AI infrastructure had its biggest year of physical change since the Hopper rollout.
Blackwell (B200 and GB200 NVL72 rack-scale systems) reached full production in 2025. IBM's Granite 4 series and OpenAI's GPT-5.3 Codex were among the first frontier models explicitly disclosed as trained on the new generation. NVIDIA followed with the announcement of B300/GB300 in Q3, 288GB HBM3e (+50% over B200), 14 PFLOPs FP4 (versus 9 on B200), shipping later. Software support, especially TensorRT-LLM, matured to the point that Blackwell now leads Hopper across the full Pareto frontier of inference performance.
Inference software consolidated around three open frameworks, vLLM, SGLang, and NVIDIA TensorRT-LLM. Distributed inference techniques previously confined to frontier labs (prefill/decode disaggregation, expert parallelism across hundreds of GPUs, scaled expert replicas for load balancing) are becoming widely accessible heading into 2026.
| Move | Detail |
|---|---|
| NVIDIA acquires Groq | ~$20B in December 2025, structured as IP licensing plus acqui-hire, folding LPU technology into the NVIDIA stack. |
| TPU v6 (Trillium) GA | Powered training of Gemini 2.5 Pro and Gemini 3 Pro, validating the TPU strategy at the frontier. |
| OpenAI inference deal with Cerebras | Joins NVIDIA, AMD, and Broadcom as multi-year inference partners. |
| Intel kills Falcon Shores | Gaudi 3 successor will not ship; no clear timeline for Jaguar Shores. |
| Meta acquires Rivos | Adds custom-silicon capability inside Meta's restructured AI org. |
NVIDIA still dominates frontier-class training. But for the first time, the inference market has multiple credible alternatives across hyperscalers (Google, AWS), challengers (SambaNova, Cerebras, Furiosa), and emerging players (Tenstorrent, Etched, MatX, Achronix, Positron).
What to watch in 2026
A few predictions worth tracking, partly mine and partly the report's:
- Agents for everything. Coding was the proving ground. Customer support, research, finance ops, and back-office workflows are next. The bar will be reliability over raw capability.
- Context engineering replaces prompt engineering. As agents take on multi-step work, structuring context, tool definitions, memory, and feedback loops will dominate over clever prompt phrasing.
- Open weights consolidates around Chinese leadership. Unless a US lab decides to release at the frontier, the most capable open models will continue to come from Beijing and Hangzhou.
- Hallucination becomes the reliability story. The report shows hallucination rate is not tightly correlated with model size, it's a training and post-training decision. Expect labs to compete on this axis explicitly.
- Inference economics gets weird. Distributed inference, disaggregation, and per-customer hardware specialization will start showing up in pricing tiers. Don't be surprised when the cheapest token isn't from the lab that trained the model.
- Voice agents reach broad deployment. Native S2S models cleared the technical bar in 2025; 2026 is the deployment year.
The throughline across all of this: the AI industry didn't slow down, didn't consolidate, and didn't run out of room to scale. It got cheaper, faster, more contested, and more useful. The interesting question for 2026 isn't whether the frontier keeps moving, it's whether the rest of the economy can keep up with what's already shipped.
Source: Artificial Analysis State of AI: 2025 Year-End Edition. All benchmark figures, ELO scores, and Intelligence Index values are from the Artificial Analysis Intelligence Index v4.0 and associated leaderboards as of end-2025 / January 2026. Cross-reference the latest model rankings with our interactive comparison tool.
Cite this analysis
If you're referencing this content in your work:
Bristot, D. (2026, May 5). The state of AI at the end of 2025: reasoning won, agents arrived, and the race got tighter. WhatLLM.org. https://whatllm.org/blog/state-of-ai-2025-year-end
Source: Artificial Analysis State of AI: 2025 Year-End Edition.