The most useful thing in Artificial Analysis' v4.1 update is not a single leaderboard change. It is a change in the unit of analysis. A token is not a task. A benchmark score is not a workflow. A fast first token is not a finished deliverable. Once models are asked to run tools, browse state, write files, recover from mistakes, and keep going for dozens or hundreds of turns, the old comparison format starts to leak.
Cost per task patches that leak. It takes the cost of running the Intelligence Index and normalizes it by the number of tasks across the evaluations. That immediately makes the model market look less like a neat ranking and more like a routing problem. You can buy more intelligence, but you do not always want to buy the most expensive path to task completion.
Live WhatLLM view
Agentic capability vs. task cost
Each point is a model with Artificial Analysis v4.1 Agentic Index and cost-per-task data in the local snapshot. The best production choices are usually not at the far top or far left; they sit on the frontier between both.
What v4.1 is really saying
Artificial Analysis v4.1 moves the Intelligence Index toward agentic work. The current index includes GDPval-AA v2, τ³-Bench Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, and AA-LCR. That mix matters. It gives more weight to models that can operate in messy work environments rather than only solve isolated questions.
The upgrade also changes how cost should be discussed. Provider price sheets still matter, but they are the wrong first abstraction for agents. Agents spend money through repeated context, tool traces, retries, reasoning tokens, answer tokens, cache writes, and cache hits. A price per million tokens can look cheap while the full task remains expensive. A model can also look expensive per token while being worth it if it finishes in fewer attempts.
Capability
Agentic Index
A better first filter for tools, terminal tasks, multi-step work, and autonomous execution than a general chat score.
Economics
Cost per task
The cleaner unit for production routing: one completed benchmark task, not one million abstract tokens.
Operations
Cache-aware pricing
Repeated context is normal in agents. Cached input economics can quietly decide whether a design scales.
The live leaderboard already tells a different story
The top of the agentic curve is still dominated by frontier proprietary models. That is expected. The interesting part is the shape underneath: open-weight and lower-cost models have become credible enough that the right answer often depends on failure cost, volume, and latency instead of raw rank alone.
Top agentic models with cost-per-task data
Sorted by Agentic Index from the current WhatLLM Artificial Analysis snapshot.
| Model | Agentic | Intelligence | Cost / task | Terminal | τ-Bench | Response |
|---|---|---|---|---|---|---|
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) Anthropic · Proprietary | 80.6 | 59.9 | $3.25 | 63% | 99% | N/A |
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic · Proprietary | 77.8 | 55.7 | $1.78 | 58% | 94% | 43.4 sec |
GPT-5.5 (xhigh) OpenAI · Proprietary | 74.1 | 54.8 | $0.993 | 61% | 94% | 2.1 min |
GPT-5.5 (high) OpenAI · Proprietary | 72.0 | 53.1 | $0.668 | 60% | 93% | 1.0 min |
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) Anthropic · Proprietary | 71.3 | 53.5 | $1.97 | 52% | 89% | 28.7 sec |
Gemini 3.5 Flash (high) Google · Proprietary | 70.3 | 50.2 | $0.614 | 41% | 95% | 22.8 sec |
MiniMax-M3 MiniMax · Open | 68.6 | 44.4 | $0.182 | 42% | 89% | 45.3 sec |
GPT-5.4 (xhigh) OpenAI · Proprietary | 68.0 | 51.4 | $1.03 | 58% | 87% | 2.0 min |
MiMo-V2.5-Pro Xiaomi · Open | 67.4 | 42.2 | $0.062 | 43% | 94% | 1.1 min |
DeepSeek V4 Pro (Reasoning, Max Effort) DeepSeek · Open | 67.2 | 44.3 | $0.056 | 46% | 96% | 1.1 min |
Read the table horizontally. Claude Opus 4.8 (Adaptive Reasoning, Max Effort) sits near the top of the capability stack at $1.78 per task. DeepSeek V4 Pro (Reasoning, Max Effort) is much cheaper at $0.056 while still posting a 67.2 Agentic Index.That gap is not a trivia point. It is the difference between an agent that can be turned loose on every support ticket and a model that should be reserved for escalations.
The practical decision tree
The wrong lesson is "use the cheapest capable model." Cheap failure is still failure. The better lesson is that agentic routing needs tiers. A good system should know when to pay for maximum capability, when to route to a value model, and when to stop using a language model at all because the workflow should be a deterministic API call.
Use the frontier model
High failure cost, ambiguous objective, expensive human review, or a task that can alter real customer or business state.
Claude Opus 4.8$1.78 / taskUse the value frontier
High volume, repeatable internal work where a strong model can finish most tasks and escalate uncertainty.
DeepSeek V4 Pro$0.056 / taskUse open weights
Data locality, cost control, fine-tuning, auditability, or deployment independence matters more than absolute rank.
MiniMax-M3$0.182 / taskOptimize for speed
The agent is interactive and users are waiting through multiple steps, retries, or tool calls.
Grok 4.3$0.145 / taskWhy cached input tokens matter more for agents than chat
Prompt caching can be a small optimization in single-turn chat. In agents, it can be structural. Agents tend to carry the same policy, tool schemas, customer context, repo map, memory summary, or scratchpad across repeated calls. If the provider gives a real cached-input discount, the same architecture can become materially cheaper without changing model quality.
This is also why "one blended price" can hide a lot. Two models with similar visible output prices may behave very differently once cache reads, cache writes, reasoning tokens, and answer tokens are counted separately. v4.1's cache-aware reporting is a nudge toward the way production systems are actually billed.
The benchmark shift is healthy
Removing saturated evaluations is not a cosmetic choice. When a benchmark stops separating frontier systems, it starts rewarding the wrong thing: confidence in stale signal. Replacing older tasks with Terminal-Bench 2.1, τ³-Bench Banking, and GDPval-AA v2 makes the index harder to game and closer to the work people are now asking models to do.
GDPval-AA v2 is especially important because it is not just "answer this question." It asks for work products. Artificial Analysis describes GDPval-AA v2 as an agentic evaluation for economically valuable tasks, with the Elo scale anchored to human expert performance at 1000 and a higher turn limit for longer trajectories. That is a different world from a multiple-choice benchmark.
What to do next
If you are building an AI agent, stop choosing models from a single leaderboard column. Start with the Agentic Index. Filter by task cost. Check context and latency. Then run a small evaluation on your own workflows with the same escalation policy you plan to ship. The right model is not the one that wins the most screenshots. It is the one that lets your system finish more work per dollar without making the expensive mistakes.
Try the routing view
The Agentic Fit Finder turns this article into a practical model-selection workflow: choose task type, autonomy level, context need, and cost/speed preference, then get a ranked shortlist from the same dataset.
FAQ
What is cost per task?
It is the average cost to complete one Artificial Analysis Intelligence Index task. It is useful because it converts model selection from abstract token pricing into a unit closer to what builders actually buy: completed work.
Does cost per task replace benchmarks?
No. It sits next to benchmarks. Capability tells you whether a model is likely to solve the task. Cost per task tells you what that capability costs when exercised in benchmark-like workloads.
Why not always use open weights?
Open-weight models are becoming excellent value choices, especially for controlled or high-volume workloads. Frontier proprietary models still matter when the failure cost is high, when the task is novel, or when the last few points of reliability are worth more than the cost gap.
Where does WhatLLM get this data?
WhatLLM uses the Artificial Analysis data APIs, including the v2 language models endpoint for Intelligence Index v4.1 headline metrics and cost-per-task data. The local snapshot is refreshed by the `/api/cron/warm-aa` path and reused across rankings, compare pages, and tools.