Blog

Benchmark analysis, model rankings, and practical guides for builders.

Analysis-Jun 16, 2026-13 min read

Cost per Task Is the New Agentic AI Model Benchmark

Artificial Analysis Intelligence Index v4.1 changes the model-selection question from "which model is smartest?" to "what does a completed agent task cost?" A data-backed guide to Agentic Index, cost per task, cache pricing, Terminal-Bench 2.1, τ³-Bench Banking, and GDPval-AA v2.

Read article

Live Rankings

View all

🏆Overall 💻Coding 🔓Open Source 🖥️Local / Self-Hosted 🦙Ollama 📄Long Context 🤖Agentic

All Articles

New AI Models May 2026: The Frontier Took a Breath, Architecture Took the Stage

After April broke the ceiling (GPT-5.5 at 60.24, Opus 4.7, DeepSeek V4, Kimi K2.6), May went quiet on scale and loud on architecture. SubQ shipped the first commercial subquadratic LLM with a 12M context. Zyphra dropped an 8B MoE trained on AMD. OpenAI made GPT-5.5 Instant the new ChatGPT default.

Monthly BriefingMay 13, 2026

The world is built out of a few narrow places

A long letter on why I built AI Bottlenecks. EUV machines, indium phosphide wafers, CoWoS slots, gas turbines, and the four physical chokepoints gating the 2026 to 2030 AI buildout. The constraint is never at the loud end of the stack.

EssayMay 5, 2026

The state of AI at the end of 2025: reasoning won, agents arrived, and the race got tighter

A synthesis of Artificial Analysis's 2025 Year-End State of AI report. Reasoning models took the leaderboard, the cost of intelligence collapsed 100x, coding agents went mainstream, and the frontier got more contested — not less.

AnalysisMay 5, 2026

DeepSeek V4 is here: the open model that made Jensen Huang's 'horrible outcome' real

DeepSeek V4-Pro and V4-Flash just dropped. 1.6T MoE, native 1M context, MIT weights, trained on Huawei Ascend. Priced at ~1/20th of Opus 4.7. The model Jensen warned about eight days ago on Dwarkesh — live on Hugging Face.

BreakingApr 24, 2026

Kimi K2.6 is here: the open model that refuses to clock out

Moonshot AI just shipped Kimi K2.6. 1T parameters, 262K context, 4,000 tool calls in a single run, and benchmarks that put it shoulder to shoulder with GPT-5.4 and Claude Opus 4.6.

AnalysisApr 21, 2026

Meta is back: Muse Spark, the rebuild, and what the benchmarks actually say

Muse Spark scored 52 on the Intelligence Index. Llama 4 Maverick scored 18. Inside the $15 billion rebuild, Meta Superintelligence Labs, and what it means for the frontier.

AnalysisApr 9, 2026

New AI Models April 2026: Anthropic Won't Ship Its Best. Open Source Will.

Claude Mythos is locked behind a 50-company firewall. GLM-5.1 beat GPT-5.4 on coding under MIT license. Gemma 4 went Apache 2.0. The full breakdown of AI's first week of April.

Weekly BriefingApr 8, 2026

New LLMs March 2026: GPT-5.4 Tied for #1. Nobody Talked About It.

GPT-5.4 matched Gemini 3.1 Pro within 0.01 points. NVIDIA unveiled trillion-parameter infrastructure. Anthropic clashed with the Pentagon. Nine text models shipped, seven open-weight. Full breakdown.

Monthly RoundupMar 24, 2026

Gemini 3.1 Pro Preview: what the .1 actually means

Google pushed a mid-cycle update to Gemini 3 Pro on February 11. We measured every benchmark delta: +4.2pp on SWE-Bench, +5.2pp on AIME 2025, #1 on LM Arena. Here is what changed and who should switch.

AnalysisFeb 20, 2026

The white-collar existential crisis: how AI killed the meaningless job

AI is not destroying jobs. It is exposing that most white-collar work was never truly meaningful. The K-shaped economy is splitting fast, agents are hiring humans, and the next 2-3 years decide which side you land on.

OpinionFeb 16, 2026

Best Open Source LLMs February 2026

GLM-5 (Reasoning) takes #1 with QI 49.64, Kimi K2.5 at #2, MiniMax-M2.5 enters. Full rankings, Ollama picks by VRAM tier, and self-hosting guide.

RankingsFeb 9, 2026

DeepSeek models: what to use, what to skip, and where to run them

A data-driven DeepSeek hub that connects benchmarks to provider reality: pricing, speed, and time to first token across hosts.

GuideJan 15, 2026

MiniMax models: what to use, what to skip, and where to run them

A MiniMax hub for builders. Understand M1 vs M2 vs M2.1 and compare providers by price, speed, and time to first token.

GuideJan 15, 2026

Top 3 AI Models January 2026: Our Expert Picks

Claude Opus 4.5 for reasoning, GLM-4.7 for open source dominance, and Gemini 3 Pro for speed.

OpinionJan 9, 2026

The unspoken bottleneck reshaping artificial intelligence

The narrative around AI has long revolved around compute power. As we enter 2026, a quieter shift is underway. Memory, not chips, is becoming the constraint.

Deep DiveJan 8, 2026

January 2026: Open source vs proprietary LLMs compared

We compared 94 model endpoints. The gap between open source and proprietary has shrunk to 5-7 quality index points.

AnalysisJan 2, 2026

2025 AI Year in Review: The Year Intelligence Became Infrastructure

Twelve months ago, we were debating whether AI could reason. Now we're debating who owns the reasoning.

Year in ReviewDec 26, 2025

The Open Source Revolution: How December 2025 Changed Everything

DeepSeek V3.2 hit 96% on AIME 2025. Xiaomi dropped a frontier model. GLM-4.7 claimed the coding crown.

Deep DiveDec 26, 2025

Three Forces That Broke OpenAI's Moat

OpenAI declared "Code Red" in December 2025. China's 15 open-weight models, the efficiency revolution, and custom silicon converged.

IndustryDec 3, 2025

The State of LLMs: December 2025

Analysis of 114 models reveals a market in transition: benchmarks saturating, open-weight matching proprietary at 10x lower cost.

AnalysisDec 3, 2025

Gemini 3 Pro vs GPT 5.1 vs Claude Opus 4.5

The AI arms race explodes in November 2025 with three frontier releases in 12 days. Benchmarks, pricing, and where each dominates.

ComparisonNov 25, 2025

Kimi K2 Thinking vs ChatGPT 5.1: Reasoning Showdown

Moonshot AI's open-weight MoE takes on OpenAI's proprietary GPT-5.1. Architecture, benchmarks, pricing.

ComparisonNov 18, 2025

Kimi K2 Thinking: How Open Weights Are Catching GPT-5

Moonshot AI's K2 Thinking lands a 67 on Artificial Analysis, sets agentic records, proves open weights compete on reasoning.

AnalysisNov 2025

Open Source vs Proprietary LLMs: 2025 Benchmark Analysis

A 94-model deep dive covering quality, price, and speed deltas across the 2025 LLM landscape.

AnalysisOct 2025

Top Open Source LLMs October 2025: Complete Guide

DeepSeek V3.1, Qwen3-235B, GLM-4.6 - performance benchmarks, pricing, and deployment insights.

GuideOct 2025

Why AI Outputs Are Turning Into Repetitive Slop

Lessons from Andrej Karpathy: model collapse, low-entropy outputs, and how to push past the slop.

OpinionOct 2025

GLM-4.5 vs Kimi-K2: Battle of the Agentic AI Giants

Z.ai's GLM-4.5 hybrid reasoning vs Moonshot AI's Kimi-K2 1T parameter architecture.

ComparisonAug 12, 2025

GLM-4.5 vs Qwen3-235B: The Ultimate Comparison

Z.ai's GLM-4.5 vs Alibaba's Qwen3-235B with massive parameter count and FP8 optimization.

ComparisonAug 12, 2025

Find the right model for your use case

Compare 100+ LLMs by price, speed, and benchmarks.

Compare Models Explore All Models Follow on X