Kimi K2.6 is here: the open model that refuses to clock out
TL;DR
- Moonshot AI shipped Kimi K2.6 on April 20, a 1T parameter MoE with 32B active, 262K context, and native vision through MoonViT.
- It is built to run 12+ hour sessions with 4,000+ tool calls and to coordinate swarms of up to 300 sub-agents. This is not a better chatbot. It is an engineer that does not log off.
- Benchmarks land at or above GPT-5.4 and Claude Opus 4.6 on HLE-Full with tools (54.0), BrowseComp (83.2), SWE-Bench Pro (58.6), GPQA-Diamond (90.5), and AIME 2026 (96.4).
- Cloudflare Workers AI lists it at $0.95 per million input, $4 per million output. Claude Opus 4.6 is roughly 15x that on heavy workloads.
- Open weights on Hugging Face under a modified MIT license. vLLM and SGLang work out of the box. No waitlist, no gated firewall, no 50-company preview.
Every few months somebody in the open source camp ships a model that forces a rewrite of the ranking. Kimi K2.6 is that model for April 2026. It is the first open weights release where the interesting benchmark is not how smart but how long it stays on task, and the answer is measured in hours.
What Moonshot actually shipped
The architecture is the same skeleton as K2 Thinking with a heavier pass of reinforcement learning and a redesigned vision tower. The spec sheet reads like last year's frontier:
- 1 trillion total parameters, 32 billion active per token. Mixture of Experts with 384 experts, 8 routed plus 1 shared.
- 61 layers, 64 MLA attention heads, 7168 hidden dim, SwiGLU, 160K vocabulary.
- 262,144 token context window. Up to 262K tokens out on some modes, with reasoning budgets that chew through 98K tokens before the final answer.
- MoonViT vision encoder at 400M parameters, native in the model rather than bolted on as an adapter.
- Modified MIT license. Weights on Hugging Face as
moonshotai/Kimi-K2.6, attribution required on large deployments.
Structured outputs, JSON schema, parallel tool calls, and an interleaved thinking mode where the model can pause mid-chain to call a tool and resume reasoning are all native. No stitching required.
The real story is duration, not intelligence
Every major lab now posts a reasoning score in the high 80s on GPQA. The interesting frontier in 2026 is not another point on a multiple choice exam. It is whether the model can still be useful at hour 9 of a coding session when the context is full of tool output, half-failed tests, and a plan it wrote to itself at hour 2.
Moonshot is leaning into that. K2.6 is marketed around 4,000 tool calls over a 12 hour run, Claw Groups for human plus agent teams, and Agent Swarms that spin up 300 sub-agents on a shared objective. The demo that is doing the rounds is a rewrite of an 8 year old financial matching engine: over 1,000 edits, a full test harness, and throughput gains the team had written off as physically impossible.
You can argue about the demo. You cannot argue about the benchmark category that used to be a proprietary moat. Long-horizon autonomy is now an open weights capability.
Benchmarks that matter
All numbers from Moonshot's release card, verified against Artificial Analysis runs where available. Thinking mode on, temperature 1.0, top-p 1.0.
| Benchmark | Kimi K2.6 | GPT-5.4 | Claude Opus 4.6 | Note |
|---|---|---|---|---|
| HLE-Full (with tools) | 54.0 | 52.1 | 53.0 | Open weights SOTA |
| BrowseComp | 83.2 (swarm 86.3) | 81.4 | 80.7 | Web research and synthesis |
| SWE-Bench Pro | 58.6 | 57.9 | 60.1 | Hardest real world repo benchmark |
| SWE-Bench Verified | 80.2 | 79.5 | 82.4 | Closed gap with Claude |
| Terminal-Bench 2.0 | 66.7 | 61.3 | 64.9 | Long-horizon shell workflows |
| LiveCodeBench v6 | 89.6 | 88.1 | 87.3 | Competitive coding |
| GPQA-Diamond | 90.5 | 89.8 | 90.2 | Graduate science reasoning |
| AIME 2026 | 96.4 | 95.1 | 94.7 | Competition math |
| MathVision (Python) | 93.2 | 92.0 | 91.4 | Vision plus reasoning |
| MMMU-Pro | 79.4 | 81.2 | 80.1 | One of the few losses |
On Artificial Analysis, K2.6 lands at #4 overall on the Intelligence Index, behind GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6, and #1 among open weights. On the Agentic Index it is either first or tied for first depending on how you weight BrowseComp against τ²-Bench.
Pricing that actually breaks the proprietary pitch
The Moonshot API lists K2.6 as kimi-k2.6. Cloudflare Workers AI exposes it at @cf/moonshotai/kimi-k2.6 for $0.95 per million input tokens and $4 per million output. Fireworks, Baseten, and Ollama are already live. A full Humanity's Last Exam sweep on the base endpoint comes in somewhere around $450 including reasoning overhead.
For comparison, running the same suite through Claude Opus 4.6 at $15 in and $75 out lands north of $6,000. GPT-5.4 sits in the middle. The quality gap does not justify the spend for most agentic workloads, which is the quiet shift that has been happening since Kimi K2 Thinking in November and is now obvious.
Run it yourself
Weights are about 594 GB at native precision. 8x H100 or 16x A100 will serve it at production throughput on vLLM. If you do not have that sitting around, Fireworks and Baseten will spin up a dedicated endpoint in minutes.
Or call the API
kimi-k2.6 on the Moonshot API is OpenAI compatible. Swap base URLs, keep your existing tool calling code. Structured outputs and thinking mode toggle through the usual flags.
Where it still trails
Three honest caveats before anyone rips out their Claude integration on a Tuesday afternoon.
- Verbosity. Thinking mode burns tokens. A full reasoning trace can exceed 90K tokens. Output pricing matters.
- MMMU-Pro. GPT-5.4 and Claude still edge it on heavy multimodal reasoning. MoonViT is strong but not yet best in class.
- Tooling maturity. The consumer Kimi app, Claw CLI, and Agent Swarm are slick, but integrations into mainstream agent frameworks like LangGraph or Temporal are still catching up. Expect a messy two weeks of community PRs.
What this means for the rest of 2026
Three things to watch.
- Anthropic's walled Mythos. If Claude Mythos stays behind a 50-company firewall while open models keep landing at frontier capability, the enterprise procurement story gets very uncomfortable very fast.
- DeepSeek and Qwen responses. Both labs have shipped on a quarterly rhythm. Expect a V4 or Qwen3.5 inside 8 weeks that narrows K2.6's lead on at least one axis.
- The real agent eval. Benchmarks are saturating. The next credible measurement is dollars of work delivered per dollar of compute per hour of autonomy. K2.6 is the first open model that can actually be measured on that axis.
The bottom line
Kimi K2.6 is not a surprise. It is the logical continuation of a curve that has been visible since DeepSeek R1 in January 2025. The thing that makes it notable is that Moonshot stopped optimizing for the benchmark of the moment and started optimizing for the workflow people actually want: a model you hand a goal to and come back to later.
If you are building agents in 2026, you need to have an answer for why you are not using this. The burden of proof flipped.
See where K2.6 sits against every other production model on live price, speed, and quality data in the What LLM comparison tool, or jump to the open source rankings.
Cite this article
Bristot, D. (2026, April 21). Kimi K2.6 is here: the open model that refuses to clock out. What LLM. https://whatllm.org/blog/kimi-k2-6
Sources: Moonshot AI (Hugging Face model card) · kimi.com release notes · Cloudflare Workers AI pricing · Artificial Analysis.