Kimi K2 Thinking proves the open weights gap is closing
TL;DR: The open weights inflection point
- Kimi K2 Thinking scores 67 on the Artificial Analysis Intelligence Index, second overall and first among open weights, while leading GPT-5 in tool assisted HLE runs and BrowseComp.
- Moonshot AI delivered a 1 trillion parameter Mixture of Experts model with 32 billion active parameters, a 256K context window, and native INT4 serving for roughly $4.6M in training cost.
- The Intelligence Index evaluation consumed 140 million output tokens yet base endpoint pricing keeps the total bill near $356, well below comparable proprietary runs.
- Chinese labs now dominate open source velocity: Moonshot, DeepSeek, Qwen, and 01.AI shipped more public large models in 2025 than Western peers, often at 10x to 100x lower inference prices.
- Community sentiment on X and enterprise adoption (Airbnb, Chamath Palihapitiya’s funds) signal that open weights releases are already displacing closed models in production.
The launch of Moonshot AI’s Kimi K2 Thinking in early November 2025 crystallized a shift that had been building all year. Open weights releases from Chinese labs are not chasing proprietary incumbents anymore. On reasoning, coding, and agentic tasks they are equal or ahead, and the cost curve now tilts decisively toward public checkpoints.
The tipping point: K2 Thinking in focus
K2 Thinking is a 1 trillion parameter Mixture of Experts model with 384 experts and 32 billion active parameters per token. Moonshot trained it for roughly four and a half million dollars, then released it under a modified MIT license with weight files hosted on Hugging Face. The checkpoint ships with a 256K context window and native INT4 quantization so inference runs twice as fast as the FP8 instruct siblings without measurable quality loss.
The model was built as a thinking agent. In Moonshot’s internal runs it autonomously selects and chains two hundred to three hundred tool calls, interleaving step-by-step reasoning, search, browsing, and code execution. That is the workflow that proprietary systems have used to justify premium pricing. K2 Thinking now offers it in the open.
Benchmarks that reset expectations
Public benchmark numbers released by Moonshot AI, Artificial Analysis, VentureBeat, and BD TechTalks put K2 Thinking ahead or level with flagship proprietary systems:
- Humanity’s Last Exam (tools): 44.9 percent, edging GPT-5 at 43.2 percent and Claude Sonnet 4.5 at 42.8 percent (VentureBeat).
- BrowseComp: 60.2 percent, the highest score on record, beating GPT-5 and Sonnet 4.5 on web navigation and factual synthesis (BD TechTalks).
- SWE-Bench Verified: 71.3 percent, leading every open weights model and rivaling closed offerings like Claude Opus 4 for software bug fixing (StartupHub.ai).
- GPQA Diamond: 85.7 percent vs GPT-5 at 84.5 percent on graduate-level science questions (VentureBeat).
- AIME and HMMT 2025: Matches GPT-5 base configurations on competition math, continuing the DeepSeek-R1 trend from earlier in the year (VentureBeat).
Verbosity is the clear trade-off. Artificial Analysis logged 140 million output tokens across the Intelligence Index suite, about two and a half times the token volume of DeepSeek V3.2 Exp and roughly double GPT-5. Even so, the base endpoint priced at 0.6 dollars per million input tokens and 2.5 dollars per million output tokens kept the full run at 356 dollars. The turbo endpoint, priced at 1.15 and 8 dollars, lands at 1,172 dollars, second only to Grok 4.
Interactive benchmarks from Artificial Analysis
The live dashboards below come directly from ArtificialAnalysis.ai, whose Intelligence, Coding, and Agentic indices power our internal comparisons. Explore the embeds to see where K2 Thinking leads and where proprietary peers still hold an edge.
Serving efficiency
Native INT4 weights weigh about 594 GB, so pre-Blackwell NVIDIA hardware can host the full model. Community inference stacks are already targeting 8 x H100 or 16 x A100 clusters for sustained throughput.
Latency profile
Base endpoint output is around eight tokens per second. The turbo endpoint raises that to roughly fifty tokens per second. Grok 4 still leads with ninety plus tokens per second, but K2 Thinking now outpaces many proprietary reasoning tiers.
Chinese labs vs Western incumbents: 2025 scorecard
A cross-lab snapshot underscores how quickly Chinese open source has matured. Training cost numbers come from VentureBeat, Asia Times, and StartupHub.ai coverage.
| Model | Type | Key strengths | Training cost | HLE score | SWE-Bench | Notable edge |
|---|---|---|---|---|---|---|
| Kimi K2 Thinking (Moonshot) | Open MoE | Agentic chaining, 384 experts, INT4 serving | Approx. $4.6M | 44.9% | 71.3% | Beats GPT-5 on tool assisted HLE and BrowseComp |
| DeepSeek-R1 (DeepSeek) | Open | Hierarchical experts, math and coding efficiency | Approx. $5.6M (V3 base) | 42.1% | 68.4% | Rivals OpenAI o1 on math at a fraction of the compute |
| Qwen-2.5 235B (Alibaba) | Open | Sparse gating, long context, enterprise support | Low (undisclosed) | 41.5% | 67.2% | Adopted by Airbnb for customer automation |
| GPT-5 (OpenAI) | Closed | Broad capability, heavy mode aggregation | Billions | 43.2% | 70.1% | Highest overall Intelligence Index score to date |
| Claude Sonnet 4.5 (Anthropic) | Closed | Safety aligned reasoning, creative writing | Billions | 42.8% | 69.8% | Strong safety guardrails with competitive quality |
Analysts such as Sebastian Raschka note that Kimi’s 384 expert mixture makes chained reasoning efficient where Western models rely on brute compute. Nathan Lambert tallied more public model releases from Chinese labs than any other region by mid 2025, and Nvidia’s Jensen Huang described China as being only “nanoseconds” behind the United States in AI progress.
Why Chinese labs are driving the open wave
- Release cadence: Moonshot shipped K2 instruct models in July and September and followed with K2 Thinking four months later. DeepSeek and Qwen operate on similar quarterly rhythms while Western peers delayed open releases for safety reviews.
- Efficiency under constraint: Export controls on advanced GPUs forced innovation. DeepSeek’s tree-structured experts and Qwen’s sparse gating keep idle compute low. INT4 post-training lets K2 Thinking run on widely available hardware.
- Economics: K2 Thinking’s API costs 0.6 dollars per million input tokens and 2.5 dollars per million output tokens. Claude 4 Opus can cost more than one hundred times that on heavy workloads.
- Adoption momentum: Airbnb selected Qwen for customer support because it is “fast, good, cheap.” Chamath Palihapitiya’s firms migrated agentic workloads to K2 Thinking for better performance at lower spend (Asia Times).
- Ecosystem breadth: Chinese labs are also shipping competitive video generators (Kuaishou, MiniMax) and multimodal reasoning models (Kimi K1.5 matching OpenAI o1 on math), creating full stack gravity.
What this means for builders
- Agent platforms: Plug K2 Thinking into research, compliance, or customer service agents. Its 93 percent τ²-Bench Telecom score shows it can execute deep playbooks without human supervision.
- Coding automation: SWE-Bench Verified at 71.3 percent makes it the top open weights coding model. Pair it with retrieval or diff review pipelines to keep verbosity in check.
- Hybrid routing: Use smaller routers (Gemma 3, Phi-4) for easy tasks and escalate to K2 Thinking when deep reasoning is required. This tames cost while capturing the quality gains.
- Infrastructure vendors: INT4 weights create opportunity for Fireworks AI, Baseten, Novita Labs, Parasail, and others to launch specialized endpoints before FP4 hardware is mainstream.
Voices from the community
- Elliot Arledge: “China won. Kimi K2 Thinking is the moment open source passed closed source on agentic tasks.”
- Chubby (Kimmonismus): “Sanctions backfired. China became more creative and might now win the AGI race because open source this year came from Beijing.”
- Victoria Slocum: “Everyone assumed cutting edge AI would always come from Silicon Valley. Kimi K2 Thinking just shattered that belief.”
- David Ondrej: “Models like Kimi, DeepSeek, and Qwen crush benchmarks at five to fifty times cheaper inference. Closed labs risk losing billions.”
- Deedy: “Kimi K2 is having a mini DeepSeek moment on OpenRouter, climbing past Grok 4 and GPT-4.1 even on creative tasks.”
- Ahmad Osman: “China saved open source LLMs. Qwen3 Next, GLM 4.5, DeepSeek V3.1, Kimi K2. The United States and Europe need to respond.”
What to watch next
- Moonshot’s agentic mode SDK: The consumer chat interface does not expose the full tool stack yet. A developer release will show how configurable verbosity and tool routing can become.
- Community fine-tunes: Expect derivatives that emphasize summary-first prompting, retrieval augmented memory, and reinforcement rewards that penalize unnecessary tokens.
- Rival responses: DeepSeek, Qwen, and 01.AI already trail new releases. Efficiency focused upgrades are likely on a quarterly cadence.
- Enterprise procurement shifts: Watch for more RFPs that require open weights options alongside proprietary APIs. Cost pressure and customization needs make hybrid stacks the default.
The bottom line
Kimi K2 Thinking is not an outlier. It confirms that the open versus closed quality gap is now single digits while economics favor the open camp. Chinese labs seized the initiative, and the open ecosystem is accelerating because of it.
Plan for parity. By 2026, assume open weights models can handle the majority of workloads you once reserved for proprietary APIs. The strategic questions now revolve around deployment discipline, prompt compression, and hybrid routing rather than access to capability.
Want a live pulse on how K2 Thinking stacks up against every other production model? Explore the What LLM comparison tool for real-time pricing, speed, and quality stats.
📚 Cite this article
If this analysis informs your work, please use the citation below:
Bristot, D. (2025, November 10). Kimi K2 Thinking: How open weights are catching proprietary AI. What LLM. https://whatllm.org/blog/kimi-k2-thinking-open-weights
Data sources: Moonshot AI (Kimi K2 Thinking technical report, November 2025) · Artificial Analysis (Intelligence Index, Agentic Index, Coding Index, November 2025) · VentureBeat · BD TechTalks · StartupHub.ai · Asia Times · CNBC · The Rundown AI