2025 AI year in review: the year intelligence became infrastructure
Twelve months ago, we were debating whether AI could reason. Now we're debating who owns the reasoning.
2025 by the numbers
The year AI stopped being a product category and became the substrate everything else runs on.
As 2025 wraps up, the temptation is to frame it as another year of incremental progress, another round of benchmark improvements and product launches. That framing would miss what actually happened. This was the year the ground shifted beneath the entire industry, when assumptions that held for a decade broke in months, and when the economics of intelligence inverted so completely that by December, a phone company was releasing frontier models under MIT license.
The numbers tell part of the story. Corporate AI investment hit $252 billion, up 44 percent from 2024. Private funding for generative AI reached $33.9 billion, tripling startup formations. Sixty-five percent of enterprises deployed generative AI in production, up from scattered pilots a year ago. But the numbers obscure the texture of what changed: models that reason rather than predict, open weights that match proprietary at a fraction of the cost, agents that complete tasks rather than just generate text, and an infrastructure race that saw NVIDIA acquire Groq for $20 billion while Google spent $4.75 billion on data center energy.
This is not a summary of press releases. It's an attempt to distill the year into what actually mattered, the shifts that will compound into 2026 and beyond. Some of what follows is technical. Some is economic. All of it is grounded in the evidence of what shipped, what worked, and what didn't.
The reasoning revolution
If you could only understand one thing about AI in 2025, it would be this: models learned to think, not just predict. The distinction sounds subtle but changes everything. Traditional language models work by predicting the next token in a sequence, a process that emerges from training on vast corpora of text. Reasoning models add a layer on top: they decompose problems into steps, verify intermediate results, and adjust their approach based on what they learn during inference.
The technical foundation for this shift was reinforcement learning with verifiable rewards, or RLVR. Unlike earlier approaches like reinforcement learning from human feedback, which relied on subjective preferences to shape model behavior, RLVR uses objective metrics in domains where success can be automatically evaluated. Mathematics problems have correct answers. Code either runs or it doesn't. Logic puzzles have verifiable solutions. By training against these objective signals, labs produced models that could "think through" multi-step problems rather than pattern-matching to memorized solutions.
OpenAI's o3 series exemplified the approach. The high variant could maintain independent operation for hours, working through complex research or strategic planning tasks that would previously require human oversight at every step. Anthropic's Claude 4.5 introduced hybrid modes that let developers choose between fast pattern-matching and deliberate reasoning depending on the task. Google's Gemini 3 Pro pushed multimodal reasoning, processing text, images, and video in unified chains of thought.
| Model | Intelligence Index | AIME 2025 | SWE-Bench | Reasoning Mode |
|---|---|---|---|---|
| GPT-5.1 High | 70 | 94% | 70% | Extended thinking |
| Claude 4.5 Sonnet | 63 | 91% | 72% | Hybrid modes |
| Gemini 3 Pro | 73 | 95% | 65% | Multimodal chains |
| Grok 4 | 65 | 89% | 68% | Real-time thinking |
| o3 | 65 | 93% | 71% | Configurable depth |
The practical impact showed in benchmarks and production alike. Frontier math evaluations exceeded 80 percent resolution. SWE-Bench for software engineering surpassed 90 percent accuracy in select variants. But perhaps more telling were the deployment stories: McKinsey reported 20-30 percent productivity gains in engineering and research workflows, with AI handling not just generation but iterative debugging and optimization. In drug discovery, AI contributed to 15 percent faster compound identification. In materials science, simulations that would take months ran in days.
Andrej Karpathy captured the nuance in his year-end review: these systems exhibit "jagged intelligence," superhuman in some domains while surprisingly brittle in others. A model that solves olympiad mathematics can fail at basic spatial reasoning. One that writes production code struggles with tasks requiring common sense. The intelligence is real but uneven, which means the value comes from knowing where to apply it and where human oversight remains essential.
Test-time compute: the new scaling law
Traditional scaling laws said: double parameters and data, get predictable capability gains. Those returns are diminishing. The new scaling dimension is test-time compute, spending more inference cycles on harder problems. Models like DeepSeek V3.2 use up to 10x more inference resources for complex tasks while remaining efficient on simple ones. This means capability can scale without proportional training costs, fundamentally changing the economics of AI development.
Open source reaches parity
The most consequential shift of 2025 wasn't a single model release. It was the accumulation of evidence that open-weight models could match proprietary systems across nearly every metric that matters, at a fraction of the cost. This happened faster than anyone predicted, driven primarily by Chinese labs operating with different incentives and constraints than their American counterparts.
China filed over 700 generative AI models in 2025, representing nearly half of global releases. But volume alone doesn't explain the impact. DeepSeek V3.2, with its 671 billion total parameters and 37 billion active through sparse Mixture-of-Experts routing, scored 96 percent on AIME 2025, surpassing GPT-5's 94 percent. Z.ai's GLM-4.7 hit 95.7 percent on the same benchmark while excelling in multilingual agentic coding. Xiaomi's MiMo-V2-Flash topped open-source charts in software engineering at 71.7 percent on SWE-Bench Multilingual.
These models weren't just competitive on benchmarks. They were radically cheaper. DeepSeek V3.2 costs $0.028 per million input tokens. Running a full evaluation suite costs around $54. The same evaluation on GPT-5 would cost over $300. On Claude Opus 4.5, closer to $800. When the Stanford AI Index analyzed 114 models, they found open weights matching proprietary across quality metrics at 10x lower inference costs.
Open-source leaders by December 2025
- DeepSeek V3.2: 96% AIME, 68% SWE-Bench, $0.10/M
- GLM-4.7: 95.7% AIME, 200K context, MIT license
- MiMo-V2-Flash: 71.7% SWE-Bench ML, embodied AI
- Kimi K2 Thinking: 67 Intelligence Index, agentic focus
- Qwen3-235B: 57 Index, $0.25/M tokens
The cost revolution
- 280x: Inference cost reduction since 2022
- 10-20x: Open vs proprietary cost advantage
- $5.57M: DeepSeek V3.2 training cost
- 30%: Chinese models' global token share
- GPT-3.5 parity: Now essentially free
The strategic implications rippled through the industry. American labs that had relied on proprietary moats found those moats eroding. OpenAI released lighter open-weight variants like GPT-oss defensively. NVIDIA's Nemotron 3 family added reinforcement learning libraries for local deployment. The competitive response was real, but it couldn't change the fundamental dynamic: open weights had achieved parity, and the trajectory suggested they would maintain or extend that position.
For enterprises, this created new options. Indie teams fine-tuned open models for niche applications, from personalized education to supply chain forecasting, without cloud dependencies. Larger organizations began shifting workloads from proprietary APIs to self-hosted deployments where the economics favored it. The MIT study estimating $24.8 billion in "wasted spending" on closed models captured the tension: switching costs kept many organizations on proprietary systems even when alternatives were objectively better.
The risks were real too. Synthetic data proliferated through training pipelines, raising concerns about model collapse where outputs grew repetitive. Karpathy's term "slop" captured the phenomenon: AI-generated text that was technically correct but lacking in originality or insight. For 2026, the lesson is clear: leverage open weights for cost and customization, but invest in proprietary data pipelines to maintain differentiation.
Agents go to work
The year's most practical advancement was the transition from assistants to agents, from systems that generate responses to systems that complete tasks. This sounds like marketing language but reflects a genuine capability shift. Models in 2025 could maintain state across sessions, call external tools, verify their own outputs, and recover from errors without human intervention.
McKinsey's survey found 65 percent of enterprises had deployed generative AI, with projections of $51 billion in agentic spend by 2028. The applications spanned industries: in logistics, agents optimized routes with 15-20 percent efficiency gains. In healthcare, triage systems achieved 90 percent accuracy. In software development, end-to-end coding agents handled projects from specification to deployment.
The tooling ecosystem matured to support this. Anthropic's Bloom framework open-sourced behavioral evaluations, enabling real-environment testing of agent capabilities. LangGraph provided stateful workflow management. CrewAI enabled multi-agent coordination with role-based team structures. These weren't research prototypes. They were production systems handling real workloads.
Agentic AI in production: where it worked
Software development
IDE extensions like GitHub Copilot and Cursor; dedicated environments like Zed; cloud agents like Devin handling full project lifecycles.
Customer support
Sierra ($10B valuation) and Decagon deploying "brand ambassadors" with multi-LLM supervision for complex resolution.
Sales automation
Salesforce Agentforce SDR and 11x's Alice autonomously prospecting and booking meetings with minimal input.
Browser automation
Claude for Chrome and ChatGPT Agent handling web tasks via vision LLMs with error recovery.
Yet the limits were equally instructive. MIT studies found productivity boosts of 20-40 percent confined to narrow, well-defined tasks. When problems became unstructured or required genuine creativity, agents struggled. Eighty-four percent of projects underdelivered due to brittleness, with agents that worked perfectly in demos failing when edge cases appeared in production. Apple research noted "complete accuracy collapse" in complex scenarios where errors compounded.
Gartner predicts 80 percent autonomous resolution of customer issues by 2029, but 2025 taught a more nuanced lesson: agents augment rather than replace. They multiply expert output in structured domains but require human loops for escalation and oversight. The organizations that succeeded treated agents as force multipliers for skilled workers, not substitutes for workforce development.
The infrastructure race intensifies
Behind the model releases and capability improvements, a different story played out: the scramble for compute, energy, and silicon that will determine who can build and deploy AI at scale. Corporate AI investment hit $252 billion, up 44 percent. Four companies committed $364 billion to data centers and chips. NVIDIA's $20 billion Groq acquisition signaled that inference optimization, not just training capability, had become strategic.
The hardware advances were substantial. NVIDIA's Blackwell B200 systems, widely available by Q3, delivered 3x throughput over H200 at 39,000 versus 13,000 tokens per second under load. Rack-scale NVL72 configurations, packing 72 GB200 chips, hit production for the largest deployments. Challengers emerged: Cerebras's Wafer Scale Engine 3 reached 125 PFLOPs; Groq's LPU architecture demonstrated extreme inference efficiency with 230 MB of on-chip SRAM eliminating memory bottlenecks.
But constraints emerged in parallel. Inference costs fell 280x from 2022 levels, but usage grew 31x, creating net demand pressure. Projections suggest AI energy consumption could match India's by 2030 if trends continue. Data centers take three years to operationalize, creating structural lags between investment and capacity. The race became not just about building the best chips but securing the power to run them.
| Deal | Value | Strategic focus |
|---|---|---|
| NVIDIA acquires Groq | $20B | Inference-optimized silicon |
| Google acquires Intersect Power | $4.75B | Data center energy |
| Salesforce acquires Informatica | $8B | Data integration for AI |
| Meta increases Scale AI stake | $14-15B | Training data operations |
Energy solutions emerged at various scales. Google's Intersect Power acquisition addressed near-term needs. Longer-term, micro-nuclear deployments from companies like Oklo promised carbon-free baseload. Most ambitiously, orbital solar clusters, inspired by China's constellation programs, offered theoretically unlimited power at scale, with early deployments demonstrating feasibility if not yet economics.
For organizations building on AI, the infrastructure lesson of 2025 is sobering: access to compute is as important as algorithmic innovation. Those who secured capacity early gained advantages that late entrants couldn't match. The next wave of AI leaders may be determined less by research brilliance than by infrastructure positioning.
Multimodal becomes mainstream
Text dominated AI's early development, but 2025 saw genuine multimodal capability reach production quality. Models didn't just process different input types; they reasoned across them, understanding images in context with text, generating video that matched audio, and handling real-time speech in conversational flows.
Google and OpenAI led the push. Gemini 2.5 Flash supported text, image, video, and speech inputs with corresponding outputs. GPT-5 variants processed images natively within reasoning chains. For video generation, OpenAI's Sora 2 and Google's Veo 3 could natively generate synchronized audio, at $0.50 and $0.40 per 1080p second respectively. Quality improved dramatically: Kling 2.5 Turbo topped text and image-to-video leaderboards; open weights like Alibaba's Wan 2.2 A14B reached competitive rankings.
Image editing advanced from generation to precise control. Gemini 2.5 Flash and GPT Image 1 enabled multi-image inputs for iterative refinement. Smaller labs pushed specific capabilities: Runway's Aleph specialized in video editing workflows; ElevenLabs drove voice agent deployment for customer support through companies like PolyAI.
The practical applications multiplied. Design workflows incorporated AI for ideation through production. Video production teams used generation for rough cuts and effects that would previously require extensive manual work. Voice interfaces moved from novelty to production deployment in customer-facing applications. The convergence of modalities into unified model architectures suggested that future AI systems would naturally handle whatever input types a problem required.
Global dynamics and the policy response
The US-China divide sharpened throughout the year, with implications that extended beyond technology into economics and geopolitics. America's $109 billion in private AI investment dwarfed China's $9.3 billion, but Chinese labs nearly closed performance gaps through sheer volume and efficiency innovations. Open models from Chinese labs captured 30 percent of global token usage by year-end, up from 13 percent on average.
Policy responses varied by region. The US Executive Order of December 11 established a national framework aimed at balancing innovation promotion with oversight, explicitly evaluating state laws for potential overreach. The EU AI Act took effect in August, with detailed guidelines for general-purpose models and tiered requirements based on risk levels. China took a different approach, removing a comprehensive AI law from its legislative agenda in favor of targeted pilots and standards development that allowed faster iteration.
Export controls created new friction points. The April rollback of H20 chip bans followed by renewed restrictions created uncertainty for multinational operations. Huawei's development of NVL72-competitive accelerators on TSMC and SMIC nodes demonstrated that restrictions accelerated domestic capability development rather than preventing it. For organizations operating globally, navigating these dynamics became a core competency rather than a peripheral concern.
Global AI sentiment and trust
Source: Stanford AI Index 2025, Pew Research surveys
The reckoning: where hype met reality
For all the genuine progress, 2025 also delivered a correction to inflated expectations. MIT research found 95 percent of generative AI pilots failed to reach production, citing brittle workflows, data quality issues, and misaligned expectations about what AI could actually do. Eighty-four percent of projects that did deploy underdelivered on original goals, with privacy, bias, and integration challenges more difficult than anticipated.
The failure patterns were instructive. Organizations that treated AI as a drop-in replacement for human workers consistently struggled. Those that redesigned workflows around AI capabilities, accepting its strengths and limitations, fared better. The most successful deployments focused on augmentation: AI handling structured, repetitive elements while humans managed exceptions, creativity, and judgment calls.
Specific failures drew attention. Advanced models showed "complete accuracy collapse" in complex scenarios according to Apple research, where small errors in early reasoning steps cascaded into completely wrong conclusions. Stanford warned about biases in mental health applications that risked dangerous outcomes for vulnerable users. Misinformation generated by AI systems caused real-world harm in multiple documented incidents.
Karpathy's metaphor of AI as "jagged ghosts," brilliantly capable in some domains while surprisingly hollow in others, captured the essential truth that 2025 revealed. The technology is powerful but not uniformly so. Knowing where to apply it, and more importantly where not to, became the differentiating skill for organizations deploying AI at scale.
Scientific AI: from simulation to discovery
Beyond commercial applications, 2025 saw AI transform scientific research across disciplines. The Stanford AI Index documented contributions to 43 percent of materials science papers, up from negligible levels just three years ago. In drug discovery, AI-assisted compound identification accelerated by 15-20 percent, with multiple candidates entering clinical trials that emerged from computational screening.
Climate modeling incorporated AI at scale, with hybrid approaches combining physics-based simulation and machine learning producing forecasts at resolutions previously computationally infeasible. Protein structure prediction, catalyzed by earlier work like AlphaFold, expanded to protein-protein interactions and dynamic behavior, opening new avenues for therapeutic development.
The pattern across domains was consistent: AI didn't replace scientific expertise but extended it, enabling researchers to explore vastly larger hypothesis spaces than human effort alone could manage. The most productive collaborations treated AI as a tool for hypothesis generation and initial screening, with human researchers providing domain knowledge, experimental design, and validation.
What 2026 looks like from here
Predictions are hazardous in a field moving this fast, but the trends of 2025 point toward specific developments in the year ahead. Agents will become reliable for a broader range of tasks, with multi-agent orchestration enabling complex workflows that require coordination. Embodied AI, from Tesla's Optimus to industrial robotics, will scale from hundreds to thousands of deployed units. The distinction between reasoning models and traditional LLMs will blur as hybrid approaches become standard.
On infrastructure, expect continued consolidation as the cost of competing at the frontier rises. Smaller labs will increasingly license or fine-tune open models rather than training from scratch. The provider layer, companies that host and serve models rather than develop them, will capture more value as model weights commoditize.
Geopolitically, fragmentation may accelerate. Export controls and data sovereignty requirements could create distinct AI ecosystems with limited interoperability. The open-source buffer, where MIT-licensed models flow across borders regardless of policy, may be the primary force preventing complete separation.
The essential lessons of 2025
Reasoning is the new scaling law. Test-time compute matters more than parameter counts.
Open source achieved parity. Proprietary advantages are narrow and expensive to maintain.
Agents augment, not replace. The value is multiplying human expertise, not eliminating it.
Infrastructure is strategy. Access to compute and energy determines what's possible.
Intelligence is jagged. Knowing where AI fails matters as much as knowing where it succeeds.
As 2025 closes, AI stands at a threshold. The technology that was experimental novelty a decade ago now runs production workloads across industries. The companies that dominated through proprietary advantages face open-source competitors matching their capabilities at a fraction of the cost. The infrastructure to support AI at scale strains against energy and manufacturing limits. And the organizations deploying AI grapple with the gap between what demos promise and what production delivers.
The year's overarching lesson may be the most important: AI's value lies not in replacement but in augmentation. The systems that worked multiplied human capability rather than substituting for it. The deployments that failed tried to automate what still requires judgment, creativity, and contextual understanding. The organizations that succeeded found the intersection where AI's strengths met human needs.
Those who build thoughtfully in 2026, balancing capability with responsibility, leveraging open models where appropriate while investing in differentiation, treating AI as infrastructure rather than magic, will shape what comes next. The intelligence economy is here. The question is who will thrive within it.
Data sourced from Stanford AI Index 2025, Artificial Analysis State of AI Q3 2025, McKinsey surveys, MIT research, and WhatLLM.org tracking. See our interactive comparison tool for the latest model data.
Cite this analysis
If you're referencing this content in your work:
Bristot, D. (2025, December 26). 2025 AI year in review: The year intelligence became infrastructure. What LLM. https://whatllm.org/blog/ai-2025-year-in-review
Sources: Stanford HAI, Artificial Analysis, McKinsey Global Survey, December 2025