What is cost per task for AI models?

Cost per task is the average dollar cost to complete one Artificial Analysis Intelligence Index task. It is more useful for agents than token price alone because agent workflows compound prompt, cache, reasoning, and output token costs over many turns.

What changed in Artificial Analysis Intelligence Index v4.1?

Version 4.1 shifts the index toward agentic workloads, including GDPval-AA v2, tau-Bench Banking, Terminal-Bench v2.1, SciCode, Humanity’s Last Exam, GPQA Diamond, CritPt, AA-Omniscience, and AA-LCR.

Should I always use the model with the highest Agentic Index?

No. The highest Agentic Index is a good starting point for high-risk, low-volume work. For repeated production workflows, compare Agentic Index with cost per task, response time, context needs, provider availability, and failure cost.

Which benchmarks matter most for AI agents?

Terminal-Bench 2.1, tau-Bench Banking, GDPval-AA v2, SciCode, AA-LCR, and related reasoning evaluations are more useful for agents than generic chat benchmarks because they test tool use, terminal work, multi-step execution, and real work deliverables.

Cost per Task Is the New Agentic AI Model Benchmark

The most useful thing in Artificial Analysis' v4.1 update is not a single leaderboard change. It is a change in the unit of analysis. A token is not a task. A benchmark score is not a workflow. A fast first token is not a finished deliverable. Once models are asked to run tools, browse state, write files, recover from mistakes, and keep going for dozens or hundreds of turns, the old comparison format starts to leak.

Cost per task patches that leak. It takes the cost of running the Intelligence Index and normalizes it by the number of tasks across the evaluations. That immediately makes the model market look less like a neat ranking and more like a routing problem. You can buy more intelligence, but you do not always want to buy the most expensive path to task completion.

Live WhatLLM view

Agentic capability vs. task cost

Each point is a model with Artificial Analysis v4.1 Agentic Index and cost-per-task data in the local snapshot. The best production choices are usually not at the far top or far left; they sit on the frontier between both.

Proprietary Open weightsLower-right is expensive. Upper-left is rare.

What v4.1 is really saying

Artificial Analysis v4.1 moves the Intelligence Index toward agentic work. The current index includes GDPval-AA v2, τ³-Bench Banking, Terminal-Bench v2.1, SciCode, Humanity's Last Exam, GPQA Diamond, CritPt, AA-Omniscience, and AA-LCR. That mix matters. It gives more weight to models that can operate in messy work environments rather than only solve isolated questions.

The upgrade also changes how cost should be discussed. Provider price sheets still matter, but they are the wrong first abstraction for agents. Agents spend money through repeated context, tool traces, retries, reasoning tokens, answer tokens, cache writes, and cache hits. A price per million tokens can look cheap while the full task remains expensive. A model can also look expensive per token while being worth it if it finishes in fewer attempts.

Capability

Agentic Index

A better first filter for tools, terminal tasks, multi-step work, and autonomous execution than a general chat score.

Economics

Cost per task

The cleaner unit for production routing: one completed benchmark task, not one million abstract tokens.

Operations

Cache-aware pricing

Repeated context is normal in agents. Cached input economics can quietly decide whether a design scales.

The live leaderboard already tells a different story

The top of the agentic curve is still dominated by frontier proprietary models. That is expected. The interesting part is the shape underneath: open-weight and lower-cost models have become credible enough that the right answer often depends on failure cost, volume, and latency instead of raw rank alone.

Top agentic models with cost-per-task data

Sorted by Agentic Index from the current WhatLLM Artificial Analysis snapshot.

Model	Agentic	Intelligence	Cost / task	Terminal	τ-Bench	Response
Claude Opus 5 (Adaptive Reasoning, Max Effort) Anthropic · Proprietary	55.3	60.7	$2.03	N/A	N/A	1.5 min
Claude Opus 5 (Adaptive Reasoning, Xhigh Effort) Anthropic · Proprietary	54.5	60.1	$1.80	N/A	N/A	34.3 sec
GPT-5.6 Sol (max) OpenAI · Proprietary	54.0	58.9	$1.54	66%	85%	2.6 min
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) Anthropic · Proprietary	52.8	59.9	$2.75	63%	99%	1.6 min
Claude Opus 5 (Adaptive Reasoning, High Effort) Anthropic · Proprietary	52.1	58.9	$1.23	N/A	N/A	24.7 sec
GPT-5.6 Sol (xhigh) OpenAI · Proprietary	51.8	57.7	$0.944	61%	85%	53.2 sec
Kimi K3 (max) Kimi · Open	50.1	57.1	$0.723	N/A	N/A	1.3 min
GPT-5.6 Sol (high) OpenAI · Proprietary	48.5	55.9	$0.771	62%	83%	21.7 sec
GPT-5.6 Terra (max) OpenAI · Proprietary	47.4	55.0	$0.624	58%	86%	2.5 min
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic · Proprietary	47.2	55.7	$2.03	58%	94%	18.9 sec

Read the table horizontally. Claude Opus 5 (Adaptive Reasoning, Max Effort) sits near the top of the capability stack at $2.03 per task. Kimi K3 (max) is much cheaper at $0.723 while still posting a 50.1 Agentic Index.That gap is not a trivia point. It is the difference between an agent that can be turned loose on every support ticket and a model that should be reserved for escalations.

The practical decision tree

The wrong lesson is "use the cheapest capable model." Cheap failure is still failure. The better lesson is that agentic routing needs tiers. A good system should know when to pay for maximum capability, when to route to a value model, and when to stop using a language model at all because the workflow should be a deterministic API call.

Use the frontier model

High failure cost, ambiguous objective, expensive human review, or a task that can alter real customer or business state.

Claude Opus 5$2.03 / task

Use the value frontier

High volume, repeatable internal work where a strong model can finish most tasks and escalate uncertainty.

Kimi K3 (max)$0.723 / task

Use open weights

Data locality, cost control, fine-tuning, auditability, or deployment independence matters more than absolute rank.

Kimi K3 (max)$0.723 / task

Optimize for speed

The agent is interactive and users are waiting through multiple steps, retries, or tool calls.

Claude Opus 5$2.03 / task

Why cached input tokens matter more for agents than chat

Prompt caching can be a small optimization in single-turn chat. In agents, it can be structural. Agents tend to carry the same policy, tool schemas, customer context, repo map, memory summary, or scratchpad across repeated calls. If the provider gives a real cached-input discount, the same architecture can become materially cheaper without changing model quality.

This is also why "one blended price" can hide a lot. Two models with similar visible output prices may behave very differently once cache reads, cache writes, reasoning tokens, and answer tokens are counted separately. v4.1's cache-aware reporting is a nudge toward the way production systems are actually billed.

The benchmark shift is healthy

Removing saturated evaluations is not a cosmetic choice. When a benchmark stops separating frontier systems, it starts rewarding the wrong thing: confidence in stale signal. Replacing older tasks with Terminal-Bench 2.1, τ³-Bench Banking, and GDPval-AA v2 makes the index harder to game and closer to the work people are now asking models to do.

GDPval-AA v2 is especially important because it is not just "answer this question." It asks for work products. Artificial Analysis describes GDPval-AA v2 as an agentic evaluation for economically valuable tasks, with the Elo scale anchored to human expert performance at 1000 and a higher turn limit for longer trajectories. That is a different world from a multiple-choice benchmark.

What to do next

If you are building an AI agent, stop choosing models from a single leaderboard column. Start with the Agentic Index. Filter by task cost. Check context and latency. Then run a small evaluation on your own workflows with the same escalation policy you plan to ship. The right model is not the one that wins the most screenshots. It is the one that lets your system finish more work per dollar without making the expensive mistakes.

Try the routing view

The Agentic Fit Finder turns this article into a practical model-selection workflow: choose task type, autonomy level, context need, and cost/speed preference, then get a ranked shortlist from the same dataset.

Open Agentic Fit Finder See live agentic rankings Compare agentic models

FAQ

What is cost per task?

It is the average cost to complete one Artificial Analysis Intelligence Index task. It is useful because it converts model selection from abstract token pricing into a unit closer to what builders actually buy: completed work.

Does cost per task replace benchmarks?

No. It sits next to benchmarks. Capability tells you whether a model is likely to solve the task. Cost per task tells you what that capability costs when exercised in benchmark-like workloads.

Why not always use open weights?

Open-weight models are becoming excellent value choices, especially for controlled or high-volume workloads. Frontier proprietary models still matter when the failure cost is high, when the task is novel, or when the last few points of reliability are worth more than the cost gap.

Where does WhatLLM get this data?

WhatLLM uses the Artificial Analysis data APIs, including the v2 language models endpoint for Intelligence Index v4.1 headline metrics and cost-per-task data. The local snapshot is refreshed by the `/api/cron/warm-aa` path and reused across rankings, compare pages, and tools.