DeepSeek V4 is here: the open model that made Jensen Huang's “horrible outcome” real
A 1.6 trillion parameter mixture-of-experts with native 1M context, MIT-licensed weights, and three built-in reasoning modes. Trained on Huawei silicon. Priced so low it rewrites the enterprise math. Eight days ago, Nvidia's CEO called this exact scenario a horrible outcome. Today it shipped on Hugging Face.
V4 at a glance
{ preview }TL;DR
- DeepSeek shipped V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active) on April 24, 2026. MIT weights on Hugging Face. API, web, and app live the same day.
- Native 1 million token context is the new default. A new hybrid attention scheme called DSA (DeepSeek Sparse Attention) cuts inference compute to roughly 27% and KV cache to 10% of V3.2 at the same length.
- V4-Pro rivals frontier closed models on LiveCodeBench 93.5, GPQA Diamond 90.1, SWE-Verified ~80, Codeforces 3206, while staying open-weight.
- Priced at approximately $1.74/$3.48 per million tokens for Pro and $0.14/$0.28 for Flash. That is roughly 1/20th the cost of Claude Opus 4.7 on output-heavy workloads.
- Heavily optimized for Huawei Ascend silicon. No Nvidia in the training loop. Jensen Huang spent an hour with Dwarkesh Patel eight days ago warning that this specific scenario would be “a horrible outcome.” Here we are.
Every so often an open model release makes the whole board reshuffle. DeepSeek V4 is one of those. Not because the benchmarks are unbeatable (they are not), but because the cost curve, the context window, the license, and the hardware story arrive in the same package, on the same day, from the same lab that shocked the industry 15 months ago. The first “DeepSeek moment” moved markets. This one moves a geopolitical thesis.
The prophecy, and the receipt
On April 16, Dwarkesh Patel published an 80-minute conversation with Nvidia CEO Jensen Huang titled “TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat.” Jensen was animated in a way he rarely is on the record. He argued against the US export-control posture, called it a “loser’s mindset,” and reached for a worst-case to make his point concrete.
“The day that DeepSeek comes out on Huawei first, that is a horrible outcome for our nation.
Eight days later, at around 7am UTC on April 24, the official @deepseek_ai account posted the V4 announcement. Two variants. Full open weights. Tech report. Live API. 1M context. Pricing that undercuts every frontier lab by an order of magnitude. And, this is the part Jensen was actually worried about, heavily optimized for Huawei Ascend accelerators with the CANN software stack underneath, not CUDA.
That is not a hypothetical anymore. That is a model card. The thing he was warning the United States about, 192 hours before it shipped, is currently the top trending repo on Hugging Face.
What DeepSeek actually shipped
Two models, same family, same architecture, wildly different sizes. Both released as preview checkpoints with weights under a permissive license, full tech report, OpenAI-compatible and Anthropic-compatible API endpoints on day one, and web/app access through Expert Mode (thinking on) and Instant Mode (thinking off).
V4-Pro
FlagshipApril 24 · MIT · 1.6T MoE
DeepSeek’s own marketing describes it as “rivaling the world’s top closed-source models”. Fair on coding and reasoning, trails Gemini 3.1 Pro on world knowledge. Open-source SOTA on agentic coding benchmarks.
V4-Flash
EfficientApril 24 · MIT · 284B MoE
Reasoning capabilities “closely approach V4-Pro” per DeepSeek’s announcement. On par with Pro on simple agent tasks. Designed to be the default production choice when Pro is overkill, which is often.
Both ship with three reasoning modes you can actually pick from: Non-think for fast daily tasks, Think High for hard analysis, and Think Max when you want the model to really chew on a problem. The legacy deepseek-chat and deepseek-reasoner endpoints now quietly route to V4-Flash variants. The older names stop working on July 24, 2026.
The architecture story: how 1M context finally became affordable
“1 million token context window” has been the marketing claim for almost two years. The honest version has always been: yes, but at what price, and at what latency? Most long-context numbers you have seen are either priced as a luxury tier or degrade badly past a few hundred thousand tokens. V4 is the first credible attempt at making 1M the standard rate.
The centerpiece is a new attention scheme DeepSeek calls DSA, or DeepSeek Sparse Attention. It combines token-wise compression with a learned sparse pattern so most of the attention computation skips the pieces of context that do not matter. Add a heavily compressed secondary attention path and you get numbers that break the usual long-context cost curve.
| At 1M context | DeepSeek V3.2 | DeepSeek V4 | Change |
|---|---|---|---|
| Single-token inference FLOPs | 100% (baseline) | ~27% | -73% |
| KV cache footprint | 100% (baseline) | ~10% | -90% |
Source: DeepSeek V4 tech report and Hugging Face model card, April 24, 2026. Measured at 1M token context length against DeepSeek V3.2 as baseline.
A 90% reduction in KV cache at the longest context setting is not a rounding error. It is the difference between 1M-context agents being a demo and being the default. If you have built anything that needs to hold a whole codebase, a long document corpus, or a multi-hour agent trace in working memory, this changes your unit economics.
Two more architectural pieces worth naming:
- Manifold-Constrained Hyper-Connections (mHC). DeepSeek’s answer to training stability at this scale. Improves residual signal propagation through deep MoE stacks and keeps training losses well-behaved without aggressive clipping.
- Muon optimizer. Faster convergence during pre-training. It is the same Muon family making the rounds in smaller-scale research; DeepSeek appears to be the first lab to run it at trillion-parameter MoE scale in production.
Pre-training corpus: more than 32 trillion tokens of diverse, high-quality data. Post-training is a two-stage pipeline. First, domain-specific “expert cultivation” using SFT plus GRPO reinforcement learning, then an on-policy distillation pass that unifies those experts into a single checkpoint you can actually serve.
Benchmarks that matter
Numbers below are from the official DeepSeek V4 tech report and Hugging Face model cards. Base model comparisons first; these are the fair apples-to-apples numbers before any instruct tuning or thinking budget.
| Benchmark | Category | V3.2-Base | V4-Flash-Base | V4-Pro-Base |
|---|---|---|---|---|
| MMLU (5-shot) | Knowledge | 87.8 | 88.7 | 90.1 |
| MMLU-Pro (5-shot) | Knowledge | 65.5 | 68.3 | 73.5 |
| HumanEval (Pass@1) | Code | 62.8 | 69.5 | 76.8 |
| MATH (4-shot) | Math | 60.5 | 57.4 | 64.5 |
| LongBench-V2 (1-shot) | Long context | 40.2 | 44.7 | 51.5 |
Source: DeepSeek V4 tech report, April 24, 2026. Base model comparison.
Instruct-tuned V4-Pro, running in “Max” thinking mode, pushes into frontier closed-model territory. The numbers DeepSeek is publishing are not marketing: they come with temperature, top-p, and reasoning budgets spelled out in the tech report.
| Benchmark | V4-Pro (Max) | Notes |
|---|---|---|
| LiveCodeBench | 93.5 | Beats many closed frontier models |
| Codeforces rating | 3206 | Grandmaster-level competitive coding |
| SWE-Bench Verified | ~80.6 | Within a point of Opus 4.6 |
| GPQA Diamond | 90.1 | Graduate-level reasoning, competitive with top closed models |
| SimpleQA-Verified | 57.9 | World knowledge; trails Gemini 3.1 Pro |
| Terminal-Bench | SOTA* | Open-source SOTA on agentic shell workflows |
| MRCR 1M / CorpusQA 1M | Leader | Clear lead vs. most competitors at full context |
*Open-source SOTA as of April 24, 2026. Closed-model leaderboards will need independent runs before this is settled.
The pattern is clear: V4-Pro is a frontier-class coder and reasoner, a clear leader in long-context retrieval, and a respectable but not dominant world-knowledge model. If your job is to build agents that write code, operate terminals, and hold a codebase in their head, this is the best open-weight option ever shipped. If your job is “answer the most obscure trivia question anyone can think of,” pick Gemini.
The Uber math
The pricing is where this release stops being a technical story and becomes a procurement one. Community-confirmed launch pricing looks like this:
| Model | Input (/M tok) | Output (/M tok) | vs. Opus 4.7 output |
|---|---|---|---|
| DeepSeek V4-Pro | $1.74 | $3.48 | ~1/22× |
| DeepSeek V4-Flash | $0.14 | $0.28 | ~1/270× |
| GPT-5.4 | $5 | $20 | ~1/4× |
| Claude Opus 4.7 | $15 | $75 | baseline |
DeepSeek pricing from community confirmations at launch (OpenRouter, early dev reports). Competitor pricing from published rate cards. Output-heavy comparison because output dominates agent workloads.
Here is the line that is doing numbers on X right now, posted within hours of the announcement:
“DeepSeek V4 is now the cheapest SOTA model available at 1/20th the cost of Opus 4.7. For perspective, if Uber used DeepSeek instead of Claude their 2026 AI budget would have lasted 7 years instead of only 4 months.”
The Uber story is real. The company rolled Claude Code out to roughly 5,000 engineers in December 2025, a widely publicized deployment that was supposed to prove the enterprise agentic-coding thesis. By mid-April 2026, four months later, they had burned through the entire 2026 AI budget. CTO Praveen Neppalli Naga called it “back to the drawing board” in an internal note that leaked to The Information. Another dev walked through the math in public: on a 60/40 output-heavy workload mix, a $116 day of Claude Opus 4.7 traffic rebills at roughly $16 on V4-Pro. That is 7.2× before anyone tunes a single prompt.
This is the real reason the release matters. Not because V4-Pro is “smarter” on some leaderboard. Because the unit economics of production agents stopped making sense in Q1, and this is the first credible repricing. Either the closed labs come down, which dents their revenue story, or enterprises start self-hosting Chinese-origin weights, which dents the geopolitical story. Both happen, probably. Legal teams will hate it. CFOs will sign anyway.
The Huawei angle: why Jensen was right to be mad
There is one detail in the V4 release that has been underreported and that matters more than the benchmarks. DeepSeek did not just release a good open model. They released a good open model that was trained and served on Huawei Ascend, the Chinese alternative to Nvidia’s GPU line, with CANN as the alternative to CUDA.
Reports indicate the team spent months rewriting core training and inference code against the Ascend 910B and 950-series accelerators. Attention kernels, fused MoE routing, the long-context path, all ported. That is expensive work, and you only do it if you are committed to not depending on Nvidia.
The CUDA moat, rendered in specifics
The bet: if you gate the world’s most capable labs off Nvidia hardware, they cannot train frontier models. CUDA is the software moat; Blackwell/Hopper is the silicon moat. Both hold. China stays a year behind.
A frontier-class open model trained on Chinese silicon, served from a Chinese software stack, with MIT-licensed weights anyone can download. The CUDA moat holds for a lot of buyers. It no longer holds at the top of the market.
This is what Jensen was arguing on the podcast. Not that China would catch up in some abstract sense. He conceded that was already happening. His specific point was that export controls accelerate the decoupling. Every chip the US refuses to sell is a reason for Huawei to get better and for CANN to get more polished. And the tipping point is not when Huawei matches Nvidia on raw performance. It is when a frontier lab ships a real model on it. Because then the rest of the Chinese AI stack has a proof point, a reference implementation, and a migration guide.
DeepSeek just shipped that proof point. The podcast was April 16. The receipt was April 24. Nvidia stock opened the morning of the release at the same level as pre-podcast, but the medium-term thesis is now unambiguously harder to argue.
April 2026 in context: the open-source avalanche
If you zoom out, V4 is not a one-off. It is the capstone on the most concentrated month of open-weight releases in AI history.
Four variants from 2B edge to 27B dense. Native text + image + audio. Google's strongest open-weight family yet.
Meta Superintelligence Labs' first real frontier model. 52 on the Intelligence Index (vs. Llama 4 Maverick's 18). The $15B rebuild pays off.
744B MoE with 40B active. Reportedly tops GPT-5.4 on SWE-Bench Pro.
Agentic coding focus, 1M context, extremely efficient on inference.
Frontier closed model with new cybersecurity safeguards. The one V4 is most directly priced against.
1T MoE, 262K context, 4,000 tool calls per run. #1 open-weight before today.
You are here. 1.6T MoE, 1M context, Huawei-trained, priced to disrupt.
Seven major releases in 24 days, five of them open-weight, three from Chinese labs. The “open source is six months behind” line, which was still defensible in December, is now a rhetorical move rather than a factual claim. On specific axes like coding, agentic tasks, and long context, the open frontier is ahead.
Community reception: first twelve hours
The official announcement thread has crossed a million views and is still climbing. Early signals from developers actually using it this afternoon:
- 1M context feels usable, not theatrical. Several devs have posted traces of agents carrying 400K-800K tokens of codebase context through multi-step plans without the usual degradation at depth.
- Agentic coding feedback is excellent. DeepSeek confirmed the model is already their in-house agentic coding workhorse. Early third-party runs through Claude Code, OpenCode, and OpenClaw confirm it works as a drop-in replacement for the Anthropic endpoint with small adapter changes.
- Pricing shock is real. A dev running a nightly agentic test harness reported rebilling a Claude Opus day from ~$116 to ~$16 on V4-Pro with no prompt tuning. The same output-heavy mix on V4-Flash came in under $3.
- Legal is the bottleneck, not quality. Enterprise teams on X are asking the obvious questions about Chinese-origin weights, data residency, and procurement. Expect three weeks of compliance review before most regulated shops can actually move workloads.
No major regressions reported yet. Refusal rate is apparently lower than V3.2 on edge-case queries, which will be read two ways depending on your priors. Safety researchers are already running their usual probes.
Three caveats before you rip out your Claude integration on a Thursday
The honest version of any release-day take needs to include the caveats. V4 has three that matter.
- Preview, not final. Both models are labeled preview. DeepSeek has historically iterated post-release (V3 had several quiet improvements between checkpoints). Expect the “official” V4 in 6-10 weeks with sharper safety work and possibly tweaked pricing.
- World knowledge trails Gemini 3.1 Pro. If your workload is heavy factual recall (research, trivia, domain knowledge), V4-Pro is competitive but not leading. It is a better reasoner than encyclopedia.
- Self-hosting is serious hardware. V4-Pro weights are approximately 900GB at native precision. You need an 8-16 Ascend 950 or 8-16 H100 setup to serve it at production throughput. V4-Flash fits much more reasonably on a single 8×H100 node. If you are building against the API, this is a non-issue; if you want true sovereignty, budget accordingly.
What this means for the rest of 2026
Four things to track coming out of today.
Four shifts, starting now
The price floor collapsed. Again.
V4-Flash at $0.14/$0.28 is the new reference price for “frontier-adjacent reasoning.” OpenAI, Anthropic, and Google will not match it. They do not have to. But they will have to defend why the premium is worth 10-50×, and the marketing answer (“it’s smarter”) is no longer enough when the gap is measured in single-digit percentage points on specific tasks.
CUDA is still the moat, except for the top.
The vast middle of the industry will keep running on Nvidia. But the top of the research frontier, the place that sets the narrative and the pricing, just proved it can operate without CUDA. That is a smaller market by revenue and a larger one by influence.
Agents were gated by cost, not capability.
Every conversation about “why aren’t more enterprises deploying agents” had the same subtext: the bill. The Uber story is the highest-profile case, but every team running Claude Code at scale knows it. When the per-token cost drops 10-20× and the context window hits 1M practically, the agent deployment curve steepens hard. Q3 is going to be noisy.
The walled-garden strategy got harder to defend.
Anthropic spent April locking Claude Mythos behind the Project Glasswing firewall: 50 companies, cybersecurity use cases, no public API. That was defensible when the alternative was “nothing comparable exists in the open.” It is harder to defend when GLM-5.1 ships MIT and beats you on SWE-Bench Pro, and DeepSeek V4 ships MIT and beats you on cost by 20×.
The bottom line
Pick your favorite narrative. DeepSeek V4 works for all of them.
If you are a builder: this is the model that finally lets you run the agent architecture you have been sketching on whiteboards since GPT-4. 1M context is practical. The price is low enough that cost is not the thing that kills the project. Go build.
If you are a CFO reviewing AI spend: print out the Uber story. Walk down the hall. Have the conversation. You are not being paranoid; the math is off by a factor of ten somewhere in your budget, and the replacement is MIT-licensed.
If you are Nvidia: you had a very good podcast eight days ago. You made the right argument. Nobody listened to Jensen in time. The “horrible outcome” is now a link on Hugging Face. Your CEO told us exactly what would happen, and it happened the same week. That is not a model-release story. That is a policy story, and you are going to spend Q2 lobbying against a wall that just fell.
If you are an open-source believer: this was a good week. The last “DeepSeek shock” in January 2025 was a revelation. This one is a pattern. Patterns are more powerful than revelations. The gap between open and closed keeps shrinking, and every time it shrinks, the deployment map redraws. We are not at the end of that curve. We are probably closer to the beginning than most closed labs want to admit.
One way to read April 2026: the month the industry stopped pretending the frontier lived inside the labs with the biggest GPU orders. DeepSeek just moved it into a MIT repo, trained on chips the US tried to keep out of China, priced like a rounding error. Eight days after one of the most powerful people in the industry warned on a podcast that this exact thing would be a catastrophe.
You do not need a stronger signal than that.
See where V4-Pro and V4-Flash land against Opus 4.7, GPT-5.4, Gemini 3.1 Pro, and the rest of the field in the What LLM comparison tool, or jump to the live open source rankings.
Frequently asked questions
What is DeepSeek V4?
DeepSeek V4 is a new family of open-weight mixture-of-experts models released on April 24, 2026 by DeepSeek AI. It ships in two variants: V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active). Both support native 1 million token context, three reasoning modes, and are MIT-licensed on Hugging Face. The API is both OpenAI- and Anthropic-compatible.
How is DeepSeek V4 pricing different from Claude or GPT?
Based on community-confirmed launch pricing, V4-Pro runs approximately $1.74 per million input tokens and $3.48 per million output; V4-Flash is roughly $0.14 input and $0.28 output. That puts V4-Pro at about 1/22× the output cost of Claude Opus 4.7, and V4-Flash at roughly 1/270×. MIT weights mean you can also self-host for free apart from hardware.
What is Jensen Huang's 'horrible outcome' and how does V4 relate?
In a Dwarkesh Patel podcast episode released around April 15-16, 2026, Nvidia CEO Jensen Huang argued against strict US AI export controls and specifically called out 'DeepSeek coming out on Huawei first' as 'a horrible outcome for our nation.' DeepSeek V4, released eight days later, is heavily optimized for Huawei Ascend accelerators and the CANN software stack instead of Nvidia's CUDA, making that specific scenario the current reality.
What is DeepSeek Sparse Attention (DSA)?
DSA is a new hybrid attention architecture in V4 that combines token-wise compression with a sparse pattern. At a 1M token context, V4-Pro uses roughly 27% of the single-token inference FLOPs and 10% of the KV cache that V3.2 would need for the same length. This is the change that makes 1M context economically viable in production, not just a benchmark talking point.
Is DeepSeek V4 really trained on Huawei chips?
DeepSeek has stated that V4 is heavily optimized for Huawei's Ascend 910B and 950-series accelerators, using the CANN software stack. The team reportedly spent months rewriting core training and inference kernels to run natively on Ascend silicon. This makes V4 the first frontier-class open model that is substantially decoupled from the Nvidia hardware ecosystem.
Can I self-host DeepSeek V4?
Yes. Both V4-Pro and V4-Flash weights are MIT-licensed on Hugging Face. V4-Flash is serviceable on a single 8×H100 node. V4-Pro is roughly 900GB at native precision and needs 8-16 H100s or equivalent Ascend hardware for production throughput. vLLM and SGLang support arrived shortly after launch.
When do the legacy DeepSeek endpoints retire?
The legacy deepseek-chat and deepseek-reasoner endpoints now route to V4-Flash variants and are scheduled to retire on July 24, 2026. If you are calling the old model names in production, plan the migration now rather than the week of cutover.
Cite this analysis
If you are referencing this article:
Bristot, D. (2026, April 24). DeepSeek V4 is here: the open model that made Jensen Huang's “horrible outcome” real. What LLM. https://whatllm.org/blog/deepseek-v4-preview
Sources: Official DeepSeek V4 announcement (@deepseek_ai) · DeepSeek V4-Pro and V4-Flash Hugging Face model cards and tech report · DeepSeek API change log · Dwarkesh Patel × Jensen Huang podcast (April 16, 2026) · The Information (Uber AI budget) · community pricing via OpenRouter and X posts from @sdrzn, @opinion_bits, and @ClaudeCodeCafe.