Kimi K2.6 is here: the open model that refuses to clock out

By Dylan Bristot11 min read

TL;DR

  • Moonshot AI shipped Kimi K2.6 on April 20, a 1T parameter MoE with 32B active, 262K context, and native vision through MoonViT.
  • It is built to run 12+ hour sessions with 4,000+ tool calls and to coordinate swarms of up to 300 sub-agents. This is not a better chatbot. It is an engineer that does not log off.
  • Benchmarks land at or above GPT-5.4 and Claude Opus 4.6 on HLE-Full with tools (54.0), BrowseComp (83.2), SWE-Bench Pro (58.6), GPQA-Diamond (90.5), and AIME 2026 (96.4).
  • Cloudflare Workers AI lists it at $0.95 per million input, $4 per million output. Claude Opus 4.6 is roughly 15x that on heavy workloads.
  • Open weights on Hugging Face under a modified MIT license. vLLM and SGLang work out of the box. No waitlist, no gated firewall, no 50-company preview.

Every few months somebody in the open source camp ships a model that forces a rewrite of the ranking. Kimi K2.6 is that model for April 2026. It is the first open weights release where the interesting benchmark is not how smart but how long it stays on task, and the answer is measured in hours.

What Moonshot actually shipped

The architecture is the same skeleton as K2 Thinking with a heavier pass of reinforcement learning and a redesigned vision tower. The spec sheet reads like last year's frontier:

  • 1 trillion total parameters, 32 billion active per token. Mixture of Experts with 384 experts, 8 routed plus 1 shared.
  • 61 layers, 64 MLA attention heads, 7168 hidden dim, SwiGLU, 160K vocabulary.
  • 262,144 token context window. Up to 262K tokens out on some modes, with reasoning budgets that chew through 98K tokens before the final answer.
  • MoonViT vision encoder at 400M parameters, native in the model rather than bolted on as an adapter.
  • Modified MIT license. Weights on Hugging Face as moonshotai/Kimi-K2.6, attribution required on large deployments.

Structured outputs, JSON schema, parallel tool calls, and an interleaved thinking mode where the model can pause mid-chain to call a tool and resume reasoning are all native. No stitching required.

The real story is duration, not intelligence

Every major lab now posts a reasoning score in the high 80s on GPQA. The interesting frontier in 2026 is not another point on a multiple choice exam. It is whether the model can still be useful at hour 9 of a coding session when the context is full of tool output, half-failed tests, and a plan it wrote to itself at hour 2.

Moonshot is leaning into that. K2.6 is marketed around 4,000 tool calls over a 12 hour run, Claw Groups for human plus agent teams, and Agent Swarms that spin up 300 sub-agents on a shared objective. The demo that is doing the rounds is a rewrite of an 8 year old financial matching engine: over 1,000 edits, a full test harness, and throughput gains the team had written off as physically impossible.

You can argue about the demo. You cannot argue about the benchmark category that used to be a proprietary moat. Long-horizon autonomy is now an open weights capability.

Benchmarks that matter

All numbers from Moonshot's release card, verified against Artificial Analysis runs where available. Thinking mode on, temperature 1.0, top-p 1.0.

BenchmarkKimi K2.6GPT-5.4Claude Opus 4.6Note
HLE-Full (with tools)54.052.153.0Open weights SOTA
BrowseComp83.2 (swarm 86.3)81.480.7Web research and synthesis
SWE-Bench Pro58.657.960.1Hardest real world repo benchmark
SWE-Bench Verified80.279.582.4Closed gap with Claude
Terminal-Bench 2.066.761.364.9Long-horizon shell workflows
LiveCodeBench v689.688.187.3Competitive coding
GPQA-Diamond90.589.890.2Graduate science reasoning
AIME 202696.495.194.7Competition math
MathVision (Python)93.292.091.4Vision plus reasoning
MMMU-Pro79.481.280.1One of the few losses

On Artificial Analysis, K2.6 lands at #4 overall on the Intelligence Index, behind GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6, and #1 among open weights. On the Agentic Index it is either first or tied for first depending on how you weight BrowseComp against τ²-Bench.

Intelligence Index: live leaderboard from Artificial Analysis. K2.6 debuted at #4 overall, #1 open weights. Use the dropdowns to toggle coding and agentic views.

Pricing that actually breaks the proprietary pitch

The Moonshot API lists K2.6 as kimi-k2.6. Cloudflare Workers AI exposes it at @cf/moonshotai/kimi-k2.6 for $0.95 per million input tokens and $4 per million output. Fireworks, Baseten, and Ollama are already live. A full Humanity's Last Exam sweep on the base endpoint comes in somewhere around $450 including reasoning overhead.

For comparison, running the same suite through Claude Opus 4.6 at $15 in and $75 out lands north of $6,000. GPT-5.4 sits in the middle. The quality gap does not justify the spend for most agentic workloads, which is the quiet shift that has been happening since Kimi K2 Thinking in November and is now obvious.

Run it yourself

Weights are about 594 GB at native precision. 8x H100 or 16x A100 will serve it at production throughput on vLLM. If you do not have that sitting around, Fireworks and Baseten will spin up a dedicated endpoint in minutes.

Or call the API

kimi-k2.6 on the Moonshot API is OpenAI compatible. Swap base URLs, keep your existing tool calling code. Structured outputs and thinking mode toggle through the usual flags.

Where it still trails

Three honest caveats before anyone rips out their Claude integration on a Tuesday afternoon.

  • Verbosity. Thinking mode burns tokens. A full reasoning trace can exceed 90K tokens. Output pricing matters.
  • MMMU-Pro. GPT-5.4 and Claude still edge it on heavy multimodal reasoning. MoonViT is strong but not yet best in class.
  • Tooling maturity. The consumer Kimi app, Claw CLI, and Agent Swarm are slick, but integrations into mainstream agent frameworks like LangGraph or Temporal are still catching up. Expect a messy two weeks of community PRs.

What this means for the rest of 2026

Three things to watch.

  1. Anthropic's walled Mythos. If Claude Mythos stays behind a 50-company firewall while open models keep landing at frontier capability, the enterprise procurement story gets very uncomfortable very fast.
  2. DeepSeek and Qwen responses. Both labs have shipped on a quarterly rhythm. Expect a V4 or Qwen3.5 inside 8 weeks that narrows K2.6's lead on at least one axis.
  3. The real agent eval. Benchmarks are saturating. The next credible measurement is dollars of work delivered per dollar of compute per hour of autonomy. K2.6 is the first open model that can actually be measured on that axis.

The bottom line

Kimi K2.6 is not a surprise. It is the logical continuation of a curve that has been visible since DeepSeek R1 in January 2025. The thing that makes it notable is that Moonshot stopped optimizing for the benchmark of the moment and started optimizing for the workflow people actually want: a model you hand a goal to and come back to later.

If you are building agents in 2026, you need to have an answer for why you are not using this. The burden of proof flipped.

See where K2.6 sits against every other production model on live price, speed, and quality data in the What LLM comparison tool, or jump to the open source rankings.

Cite this article

Bristot, D. (2026, April 21). Kimi K2.6 is here: the open model that refuses to clock out. What LLM. https://whatllm.org/blog/kimi-k2-6

Sources: Moonshot AI (Hugging Face model card) · kimi.com release notes · Cloudflare Workers AI pricing · Artificial Analysis.