Meta is back: Muse Spark, the rebuild, and what the benchmarks actually say

The jump at a glance

18 → 52

Intelligence Index (AA v4)

42.8%

HealthBench Hard (#1)

10x+

Compute efficiency gain

262k

Context window (tokens)

$15B

Capital deployed

Llama 4 Maverick to Muse Spark is the largest single-generation jump on the Intelligence Index by any major lab. The gap between 18 and 52 is wider than the gap between Muse Spark and the current #1.

The setup

A year ago, Meta shipped Llama 4 and watched the AI community move on. The benchmarks looked inflated. The open-source community that had championed Llama 2 and Llama 3 turned hostile. Llama 4 Maverick scored 18 on Artificial Analysis's Intelligence Index, placing it below models half its training budget. The bigger model, Behemoth, was delayed indefinitely and effectively became vaporware.

Yesterday, Meta announced Muse Spark. It scored 52 on the same index. For context: Claude Opus 4.6 sits at 53. GPT-5.4 at 57. Gemini 3.1 Pro at 57. Meta went from irrelevant to top-five in a single generation.

Between those two numbers sits one of the most expensive and aggressive rebuilds in tech history. A new Chief AI Officer poached from Scale AI. $15 billion in capital deployment. A standalone frontier lab that did not exist twelve months ago. The quiet departure of the man who built Meta's AI reputation over the past decade. And a complete, ground-up rebuild of everything: architecture, data, optimization, reinforcement learning, inference stack.

This is the full story of how Meta got here, what Muse Spark can and cannot do, and what it means for the frontier.

The Llama 4 disaster, honestly

Meta released the Llama 4 family in April 2025 with the usual fanfare. Scout, Maverick, and a preview of Behemoth. Mixture-of-experts architecture. Massive context windows. Open weights. The messaging positioned it as the next step in Meta's campaign to democratize frontier AI.

The reception was brutal. Independent benchmarks did not match Meta's claims. The /r/LocalLLaMA community, which had been one of Meta's strongest advocates through two previous Llama generations, turned openly critical. Words like "mid" and "underwhelming" dominated the threads. On coding benchmarks specifically, Maverick underperformed relative to its parameter count and training cost.

Then came the leaks. Internal communications surfaced describing post-training tweaks designed to inflate scores on popular evaluations. The term "benchmark gaming" entered the conversation and stuck. Whether the practices were genuinely deceptive or just aggressive optimization is still debated. But the perception damage was immediate and irreversible.

Behemoth, the model that was supposed to anchor the family as Meta's frontier play, never materialized. Multiple delays were attributed to capability shortfalls that the team could not resolve. By summer 2025, it was clear Behemoth was not coming.

The broader consequence went beyond one product cycle. Meta's AI credibility, carefully built over three Llama generations and years of open-research goodwill, evaporated in weeks. By mid-2025, the frontier conversation had moved to GPT-5, Gemini 3, and Claude Opus 4.5. Meta was not part of it.

The scorched-earth rebuild

What happened next unfolded over roughly nine months. The details are still emerging through a mix of official announcements, leaks, and the occasional on-record interview. Here is the timeline as we understand it.

Apr 2025Llama 4 launches

Scout, Maverick, Behemoth preview. Maverick scores 18 on Intelligence Index. Community backlash begins.

May–Jun 2025Internal reckoning

Benchmark gaming allegations surface. Behemoth delays. Talent attrition accelerates.

Jun 2025Alexandr Wang hired

Scale AI CEO joins as Chief AI Officer. Meta invests $14–15B for 49% of Scale AI.

Jul–Aug 2025MSL formed

Meta Superintelligence Labs created. Aggressive recruiting from OpenAI, Anthropic, DeepMind.

Late 2025Ground-up rebuild

New pretraining stack, data curation pipeline, RL systems. "Demo, don't memo" culture.

Early 2026Yann LeCun departs

Meta's chief AI scientist since 2013 leaves. Publicly criticizes new leadership.

Apr 8, 2026Muse Spark announced

Scores 52 on Intelligence Index. First closed-source model from Meta. Available on meta.ai.

In June 2025, Mark Zuckerberg hired Alexandr Wang as Chief AI Officer. Wang was 28, the founder and CEO of Scale AI, and widely regarded as one of the most operationally effective leaders in AI infrastructure. Meta paid roughly $14 to $15 billion for a 49% stake in Scale AI as part of the arrangement. By any measure, it was the most expensive executive hire in tech history.

Wang's mandate was total: rebuild Meta's frontier AI capability from scratch. He created Meta Superintelligence Labs, a new division physically and organizationally separated from the existing GenAI team that had produced Llama 4. The old team continued to maintain the Llama line. MSL's job was to build something new.

The talent acquisition was aggressive. Reports described compensation packages ranging from $100 million to $300 million for key researchers poached from OpenAI, Anthropic, and Google DeepMind. Nat Friedman, the former GitHub CEO, joined. Daniel Gross came on board. The team grew fast.

Then, in early 2026, Yann LeCun left Meta. LeCun had been Meta's chief AI scientist since 2013 and the intellectual architect of much of its research identity. His departure was public and pointed. He criticized the new leadership as "inexperienced" and the strategic shift away from open research as a fundamental mistake. Coming from a Turing Award winner, the criticism carried weight. But it also confirmed what everyone already suspected: Meta's old AI era was over.

The scope of the rebuild was comprehensive. New pretraining architecture. New data curation pipeline. New RL post-training systems, including what Meta describes as reinforcement learning with "thinking time penalties" and "thought compression" techniques designed to make inference more efficient. MSL's internal culture reportedly shifted to "demo, don't memo," prioritizing working prototypes over research papers.

What Muse Spark actually is

Muse Spark is the first model out of Meta Superintelligence Labs. It is closed-source. That alone makes it a turning point for Meta's AI strategy. But the technical details matter just as much as the licensing.

Core capabilities

InputText, images, and speech (natively multimodal)

Context262,144 tokens

ReasoningVisual chain-of-thought, tool use, multi-agent orchestration

ModesInstant, Thinking, Contemplating (new)

Efficiency10x+ less compute than Llama 4 Maverick for equivalent capability

Availability (as of April 9)

Live nowmeta.ai and Meta AI app (US, expanding)

ComingWhatsApp, Instagram, Facebook, Messenger, Ray-Ban Meta glasses

APIPrivate preview only. Select partners. No public pricing yet.

LicenseClosed-source (proprietary). No third-party hosting.

The most technically interesting feature is Contemplating mode. Most "thinking" or "reasoning" modes in current frontier models work by extending a single inference chain: the model thinks longer on one thread. Contemplating takes a different approach. It orchestrates multiple sub-agents that reason in parallel on different aspects of a hard problem, then synthesizes their outputs into a final answer.

If that sounds abstract, the practical implication is simple: better performance on hard tasks without the proportional latency increase that longer single-chain reasoning creates. Meta's benchmarks show significant jumps when Contemplating mode is enabled, particularly on Humanity's Last Exam, ARC AGI 2, CharXiv Reasoning, and Frontier Science.

The headline efficiency claim: Muse Spark reaches or exceeds Llama 4 Maverick's capabilities with over an order of magnitude less compute. This is an infrastructure claim, not a benchmark number. If true, it means MSL's rebuilt training stack found something genuinely novel in how models learn. Meta attributes this to the new architecture plus RL techniques for penalizing unnecessary reasoning tokens and compressing thought chains.

One telling detail: during the Artificial Analysis evaluation, Muse Spark used approximately 58 million output tokens across the full benchmark suite. That is comparable to Gemini 3.1 Pro and dramatically lower than Claude Opus 4.6 (157 million) or GPT-5.4 (120 million). The model is not just performing well. It is performing well while being concise.

The benchmarks, without the press release

Meta published detailed benchmarks with a methodology document. Artificial Analysis ran an independent evaluation with early access. Here is where the model actually stands relative to the current frontier.

Benchmark	Muse Spark	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
Intelligence Index (AA v4)	52	57	53	57
HealthBench Hard	42.8%	40.1%	14.8%	—
GPQA Diamond	89.5%	92.8%	92.7%	94.3%
CharXiv Reasoning	86.4%*	82.8%	—	80.2%
ARC AGI 2	42.5%*	76.1%	—	76.5%
SWE-Bench Verified	77.4%	—	—	—
MMMU-Pro	~80.5%	—	—	—
HLE (w/ tools)	~50–58%*	—	—	~48%
Frontier Science	38%*	—	—	—

* Contemplating mode. Green rows: Muse Spark leads. Red row: significant gap to leaders. Data from Meta's evaluation methodology document and Artificial Analysis Intelligence Index v4.0.

The honest read: Muse Spark is a top-five model. It is not the best model. Gemini 3.1 Pro and GPT-5.4 still lead the Intelligence Index at 57. Claude Opus 4.6 edges ahead at 53. Muse Spark sits at 52.

But context transforms those numbers. Llama 4 Maverick scored 18 on the same index. The gap between 18 and 52 is wider than the gap between Muse Spark and the current leaders. Meta did not just improve. They jumped three tiers in a single generation.

Where Muse Spark genuinely leads: HealthBench Hard. At 42.8%, it outperforms GPT-5.4 (40.1%), Claude Opus 4.6 (14.8%), and everything else in the field. Meta trained it with input from over 1,000 physicians, and the results are clear. If medical AI is your domain, this is the model to watch right now.

CharXiv Reasoning (86.4% in Contemplating mode) is another standout. This benchmark measures understanding of charts, figures, and visual data. Muse Spark beats both GPT-5.4 (82.8%) and Gemini 3.1 Pro (80.2%) here. The multimodal perception story is real.

Where it falls short: ARC AGI 2. At 42.5%, Muse Spark trails Gemini 3.1 Pro (76.5%) and GPT-5.4 (76.1%) by a massive margin. This is abstract reasoning, the kind of fluid pattern recognition that many researchers consider a proxy for general intelligence. The gap here is not competitive. It is structural.

GPQA Diamond (89.5%) is strong but not leading. Gemini 3.1 Pro hits 94.3% on the same benchmark. PhD-level science questions remain a tier where the Google model dominates.

Coding is mixed. SWE-Bench Verified at 77.4% is competitive but not frontier-defining. Meta explicitly calls out "long-horizon agentic systems and coding workflows" as areas of heavy ongoing investment. Translation: they know this is a weakness and they are working on it.

The closed-source pivot

This is the part that stings for the open-source community. And it should.

Meta spent three years building goodwill as the company that believed AI should be open. Llama 2 was a turning point for the industry. Llama 3 cemented Meta's reputation as the most important contributor to open-weight AI. "Open source is the path to progress" was not just a talking point. It was Meta's entire identity in this space.

Muse Spark breaks that. Not completely. Meta says the Llama line will continue separately, and larger Muse models may eventually be open-sourced. But the flagship model, the one with the best benchmarks, the one that represents the future direction of Meta's AI, is proprietary. No download. No self-hosting. No third-party inference providers.

The open vs. closed split

What Meta says

• Llama continues as a separate open-weight line
• Future Muse models may be open-sourced
• Muse Spark is "the first step" and efficiency matters more than access right now
• Private API preview for select partners

What the community hears

• The best model is no longer open
• "May" and "hopes to" are not commitments
• Meta wants API revenue, just like everyone else
• The open-source champion became another closed lab

The strategic logic is clear enough. Meta wants API revenue. Meta wants tighter integration with its own products: Instagram, WhatsApp, Messenger, the Ray-Ban glasses. Meta wants to compete directly with OpenAI and Anthropic on commercial API services. None of that works if the model is open-weight from day one and anyone can undercut your pricing by self-hosting.

Whether you see this as pragmatic evolution or a betrayal depends on how much you valued the original open promise. For the /r/LocalLLaMA community that rallied around Llama 2 and 3, this feels like a gut punch. For Meta's shareholders, it probably looks like the company finally acting like a business.

The "personal superintelligence" bet

Meta frames Muse Spark as the first step toward what it calls "personal superintelligence." Strip away the marketing language and the strategy is straightforward. Meta has something no other AI lab has: the social graph. Three billion people use Meta's apps daily. Their follows, their interests, their conversations, the creators they engage with, the communities they belong to.

If you are going to build a personalized AI that actually knows its user, Meta's data advantage is hard to replicate. The launch demo showcased this: personalized answers drawing from your followed creators and public posts. Shopping recommendations based on actual interests, not keyword matching. Health guidance aligned with physician standards. Visual annotations for real-world troubleshooting through the Ray-Ban Meta glasses.

None of this is possible if the model runs on someone else's infrastructure with no access to your social context. The closed-source strategy and the personal superintelligence strategy are the same strategy. Meta is not building a general API product. It is building an AI layer that sits on top of the largest social platform in history. The model needs to live inside Meta's ecosystem for that to work.

That does not make the open-source community feel better. But it does explain why Meta made the decision it did. This is not a pivot born of greed. It is a pivot born of product vision. Whether you buy that vision is a separate question.

What to make of all this

Four takeaways

Meta is genuinely back in the frontier conversation

Not at the top. Not yet. But the jump from 18 to 52 is the most dramatic single-generation improvement any major lab has achieved on the Intelligence Index. Early analyst reactions describe it as putting Meta "back in the conversation" at the frontier level. That is accurate. A year ago they were a punchline. Today they are competitive.

The efficiency story matters more than the benchmark scores

If Muse Spark truly reaches Maverick-level performance with an order of magnitude less compute, the implication goes beyond one model. It means MSL's rebuilt training stack found something meaningful about how to train efficiently. Larger Muse models are already in development. If the efficiency gains compound across scale, the next Muse model could close the remaining gap to the top fast.

Contemplating mode is worth watching

Multi-agent orchestration for inference is not new as a research concept. But shipping it as a first-class product feature in a frontier model is. If Contemplating mode delivers the claimed gains without proportional latency cost, it changes the cost-performance tradeoffs for anyone solving hard reasoning problems. The benchmark deltas between Thinking and Contemplating mode on HLE and ARC AGI 2 are significant.

The gaps are real, and Meta knows it

Scoring 42.5% on ARC AGI 2 when the leaders hit 76% is not "competitive." It is a structural gap. The coding benchmarks are solid but unremarkable. Meta's own blog post calls out agentic and coding capabilities as areas of heavy investment. That honesty is refreshing, but it also means Muse Spark is not the model you choose for top-tier abstract reasoning or autonomous coding workflows today.

Where this goes next

Meta has been explicit that larger Muse models are already in development. Muse Spark is described as "the first step on our scaling ladder," which is the kind of language you use when you are confident the next step exists. Given the efficiency breakthrough, scaling up the Muse architecture should yield predictable gains. The question is how fast those gains close the remaining five-point gap to Gemini 3.1 Pro and GPT-5.4.

The API rollout will be the near-term story. Right now, Muse Spark is only available through Meta's own apps and a private preview. If Meta wants to compete commercially with OpenAI's and Anthropic's API businesses, they need public access with transparent pricing. The timing and pricing of that launch will reveal how seriously Meta is pursuing enterprise AI revenue versus just powering its own consumer products.

And the open-source question will not go away. The Llama community is watching. If Meta ships Muse 2 and it is also closed, the "we hope to open-source future versions" line will start to sound like a deflection rather than a roadmap. If they actually release a Muse model under an open license, even a smaller one, it would quiet a lot of the criticism.

Three things to watch in Q2 2026

1.Public API pricing. When Meta opens Muse Spark to developers, the price point will signal whether they see this as a premium play (competing with Claude Opus) or a volume play (undercutting the market). Private preview pricing is not yet public.

2.Muse 2 timeline and licensing. The pace of the next Muse release and whether it is open or closed will define whether Muse Spark was the exception or the new rule for Meta's AI strategy.

3.ARC AGI 2 and coding gaps. The abstract reasoning and coding weaknesses are the clearest blockers to Muse Spark competing at the very top. If the next model does not close the ARC AGI gap meaningfully, the "efficiency breakthrough" narrative loses credibility.

The bottom line

A year ago, Meta's AI division was a mess. Llama 4 was a disappointment. Behemoth was vaporware. The team was bleeding talent. The company that had positioned itself as the champion of open AI was losing the frontier race to labs a fraction of its size.

Today Meta shipped a top-five frontier model, built by a team that did not exist twelve months ago, using a training stack that did not exist nine months ago. The Intelligence Index jump from 18 to 52 is not incremental progress. It is a different tier of capability, achieved through what appears to be a genuinely novel efficiency breakthrough.

Is Muse Spark the best model in the world? No. Gemini 3.1 Pro and GPT-5.4 are still ahead. The abstract reasoning gap is real. The coding story is unfinished. The closed-source pivot will cost Meta goodwill it spent years earning.

But none of that changes the trajectory. Meta went from irrelevant to competitive in a single generation. Whatever else you think about the strategy, the execution is hard to argue with. Meta is back. The interesting part is what they do with it.

Frequently asked questions

What is Muse Spark?

Muse Spark is Meta's new frontier AI model, announced April 8, 2026. It is the first model from Meta Superintelligence Labs (MSL), a new division led by Alexandr Wang. It is natively multimodal (text, image, speech), supports tool use and multi-agent orchestration, and has a 262k token context window. It scored 52 on the Artificial Analysis Intelligence Index, up from Llama 4 Maverick's 18.

How does Muse Spark compare to GPT-5.4 and Claude Opus 4.6?

On the Intelligence Index, Muse Spark (52) sits behind GPT-5.4 (57) and Claude Opus 4.6 (53) but in the top 5 overall. It leads all models on HealthBench Hard (42.8%) and CharXiv Reasoning (86.4%). It trails significantly on ARC AGI 2 (42.5% vs. 76%+ for leaders) and GPQA Diamond (89.5% vs. 94.3% for Gemini 3.1 Pro).

What is Meta Superintelligence Labs?

MSL is Meta's restructured frontier AI division, created in mid-2025 after Llama 4's disappointing launch. Led by Alexandr Wang (former Scale AI CEO, now Meta's Chief AI Officer), it rebuilt Meta's entire training stack from scratch over nine months. Muse Spark is its first public model.

Is Muse Spark open source?

No. Muse Spark is closed-source, a departure from the open-weight Llama series. Meta says it "hopes to open-source future versions" of the Muse family, but the current release is proprietary. Access is limited to meta.ai, the Meta AI app, and a private API preview for select partners. No third-party inference providers are involved yet.

What happened to Llama 4?

Llama 4 launched in April 2025 (Scout, Maverick, Behemoth preview) as open-weight multimodal models. The reception was negative: independent benchmarks didn't match claims, benchmark gaming allegations surfaced, and Behemoth was delayed indefinitely. Maverick scored 18 on the Intelligence Index. The fallout led to Meta's AI reorganization and the creation of MSL.

What is Contemplating mode?

Contemplating is Muse Spark's approach to hard reasoning. Instead of extending a single inference chain, it orchestrates multiple sub-agents that reason in parallel on different aspects of a problem, then synthesizes their outputs. Meta claims this delivers significant gains on difficult benchmarks without proportional latency increases. Available alongside Instant and Thinking modes in the Meta AI app.

Cite this analysis

If you are referencing this analysis:

Bristot, D. (2026, April 9). Meta is back: Muse Spark, the rebuild, and what the benchmarks actually say. What LLM. https://whatllm.org/blog/meta-is-back-muse-spark

Sources: Meta AI blog, Meta Newsroom, Artificial Analysis Intelligence Index v4.0, Meta evaluation methodology document, Fortune, Axios, Wired, Ars Technica, CNBC, WhatLLM.org tracking. April 2026.

Data sourced from Meta AI blog, Meta Newsroom, Artificial Analysis, Meta evaluation methodology, and WhatLLM.org tracking. See our interactive model explorer for live pricing, speed, and benchmark data across 280+ models.