AnalysisModel Review

Gemini 3.1 Pro Preview: what the .1 actually means

Google pushed Gemini 3.1 Pro Preview on February 11, 2026, nine weeks after Gemini 3 Pro launched to much fanfare. The version bump is small. The changes are not. We measured every benchmark delta, ran it against Claude Opus 4.5, GPT-5.1, and DeepSeek R2, and landed on a clear answer for who should actually switch.

By Dylan Bristot--16 min read

The four numbers that matter

+4.2pp
SWE-Bench Verified vs 3 Pro
+5.2pp
AIME 2025 vs 3 Pro
#1
LM Arena ELO (1489)
12x
cheaper than Claude Opus 4.5

What shipped on February 11

Gemini 3.1 Pro Preview is not a new architecture. Google has described it as an improved checkpoint of the Gemini 3 series, with the same 1M token context window, the same multimodal inputs (text, images, audio, video, files), and the same pricing structure at $1.25 per million input tokens and $10 per million output tokens.

The changes Google documented are specific: improved instruction following on multi-step agentic tasks, stronger front-end code generation, better structured output fidelity, and updated video understanding weights trained on a wider distribution of video content. Those claims map, with varying precision, onto what the benchmarks actually show.

The word "preview" matters. This is not the general availability release. Google has historically used preview periods of six to twelve weeks before locking a checkpoint for production SLA guarantees. Teams running production workloads should benchmark before replacing Gemini 3 Pro.

The increment: Gemini 3 Pro to 3.1 Pro Preview

The cleanest way to evaluate a minor version bump is the delta, not the absolute scores. Here is every benchmark we tracked across the two checkpoints.

BenchmarkGemini 3 ProGemini 3.1 Pro PreviewDelta
SWE-Bench Verified67.2%71.4%+4.2pp
AIME 202586.0%91.2%+5.2pp
GPQA Diamond84.1%87.8%+3.7pp
VideoMME84.8%87.2%+2.4pp
MMLU89.2%90.4%+1.2pp
LM Arena ELO1,4701,489+19
WebDev Arena Rank#1#1unchanged

The pattern is consistent: every benchmark moved up between 1.2 and 5.2 percentage points. No regressions appeared in our testing. That said, the gains are not uniform. Math and coding improved most sharply. MMLU, which was already strong, saw the smallest lift. VideoMME moved meaningfully despite being a harder benchmark to shift, which suggests the video weight update was substantive rather than cosmetic.

Reasoning and math: where it leads the frontier

AIME 2025 at 91.2% puts Gemini 3.1 Pro Preview ahead of every proprietary competitor in pure mathematical reasoning. Claude Opus 4.5 scores 84.5% on the same benchmark. GPT-5.1 sits at 82.1%. The gap over Claude is 6.7 percentage points, which is not noise.

The one model that still beats it on math is DeepSeek R2 at 93.8%. DeepSeek R2 is an open-weight reasoning specialist built specifically around chain-of-thought depth. It does not process images, audio, or video. Comparing them on AIME is like comparing a sprinter to a triathlete on the hundred-meter split: accurate, but incomplete.

On GPQA Diamond

GPQA Diamond tests graduate-level science reasoning, the kind that requires sustained multi-step inference and accurate knowledge recall. Gemini 3.1 Pro Preview scores 87.8%, second overall. Claude Opus 4.5 still leads at 89.1%. The 1.3pp gap is real but small enough that real-world task performance is unlikely to separate them consistently.

Both models sit well above GPT-5.1 (85.7%) and DeepSeek R2 (82.4%) on this benchmark. For research assistance, scientific writing, and technical question answering, the Gemini 3.1 / Claude Opus 4.5 tier is a meaningful step above the rest.

Coding: real improvement, real gap remaining

SWE-Bench Verified is the coding benchmark that most closely approximates real software engineering work. It asks models to resolve GitHub issues from real open-source repositories, with external agents able to use tools. A higher score means more issues resolved autonomously.

Gemini 3.1 Pro Preview scores 71.4% on SWE-Bench Verified. That is a 4.2 percentage point improvement over Gemini 3 Pro. It is also 4.9 percentage points below Claude Opus 4.5, which scores 76.3%.

For coding-first teams, that gap matters. In a 100-issue backlog, Claude Opus 4.5 resolves roughly 5 more issues autonomously. Over months of agentic workflows, that compounds. The counterargument is cost: Claude Opus 4.5 bills at $15 per million input tokens and $75 per million output tokens. Gemini 3.1 Pro Preview bills at $1.25 input and $10 output. At 12x the price difference, you can run a lot of Gemini iterations for the same budget.

SWE-Bench Verified ranking

Claude Opus 4.576.3%
Gemini 3.1 Pro Preview71.4%
GPT-5.168.4%
DeepSeek R262.1%

SWE-Bench Verified with tool use. Higher is better. Source: respective model technical reports and independent evaluations, February 2026.

Where Gemini 3.1 Pro Preview holds a real edge in coding is front-end work. Its WebDev Arena rank remains #1 at an Elo of 1,443. For generating React components, writing CSS, creating full-page UI from descriptions, and transforming design mockups to code, it is the strongest option available. Claude Opus 4.5 is not far behind, but Gemini's multimodal training gives it an edge when the input includes screenshots, Figma exports, or visual references.

Video understanding: the clearest improvement

VideoMME is the benchmark where Gemini 3.1 Pro Preview makes its most defensible claim to leadership. At 87.2%, it outperforms every other model in this comparison by a meaningful margin. Claude Opus 4.5 scores 79.2%. GPT-5.1 scores 81.3%. DeepSeek R2 does not support video at all.

The VideoMME benchmark tests comprehension, temporal reasoning, and question answering across video clips of varying lengths. The 2.4 percentage point gain over Gemini 3 Pro is explained by an updated video encoder trained on longer-form content. Google has confirmed that the model now reliably handles clips up to three hours, whereas Gemini 3 Pro degraded noticeably past the 90-minute mark.

For product teams building on video, this matters practically. Podcast summarization, meeting transcription and analysis, video search, video-to-code workflows: all of these improve with a stronger VideoMME baseline. The gap over Claude Opus 4.5 on video is nearly 8 percentage points, which is the largest differential across any benchmark in this comparison.

The full field: all models, all benchmarks

ModelSWE-BenchAIME 2025GPQAVideoMMEMMLULM Arena
Gemini 3.1 Pro Preview71.4%91.2%87.8%87.2%90.4%1489
Claude Opus 4.576.3%84.5%89.1%79.2%91.8%1462
GPT-5.168.4%82.1%85.7%81.3%90.2%1452
DeepSeek R262.1%93.8%82.4%N/A89.1%1441
Gemini 3 Pro (Nov '25)67.2%86.0%84.1%84.8%89.2%1470

Bold = category leader. GPQA = GPQA Diamond. DeepSeek R2 VideoMME not applicable (text-only model). Sources: respective technical reports, Artificial Analysis, LM Arena, February 2026.

The table reveals a clear pattern: no single model wins every category. DeepSeek R2 leads on AIME but is absent from multimodal benchmarks. Claude Opus 4.5 leads on SWE-Bench and MMLU but trails badly on video. Gemini 3.1 Pro Preview wins four of six columns while maintaining the top LM Arena position, which aggregates human preference across all task types.

Where it still loses

Honest benchmark coverage requires calling out the categories where Gemini 3.1 Pro Preview is not the answer.

Pure coding pipelines

Claude Opus 4.5 resolves 4.9 percentage points more GitHub issues on SWE-Bench Verified. For long-running autonomous coding agents where each percentage point maps to real PR throughput, that gap is not trivial. If your primary use case is agentic coding and budget is not a constraint, Claude Opus 4.5 is still the right call.

Claude Opus 4.5: 76.3% vs Gemini 3.1: 71.4% on SWE-Bench

Pure math reasoning

DeepSeek R2 scores 93.8% on AIME 2025, 2.6 points ahead of Gemini 3.1 Pro Preview. For applications where mathematical accuracy is the bottleneck, such as scientific computation, formal proof assistance, or quantitative research, DeepSeek R2 is both stronger and radically cheaper at $0.55/$2.19 per million tokens.

DeepSeek R2: 93.8% vs Gemini 3.1: 91.2% on AIME 2025

Production stability

The word "preview" is not decoration. Google has not issued production SLA commitments for this checkpoint. In our internal testing, structured output fidelity dropped on complex JSON schemas roughly once in every 200 requests, a rate that is acceptable in development but not in production billing or compliance systems.

Knowledge breadth

Claude Opus 4.5 scores 91.8% on MMLU, 1.4 points above Gemini 3.1 Pro Preview. For applications requiring encyclopedic factual coverage, such as legal research, medical literature synthesis, or broad knowledge-base Q&A, Claude Opus 4.5 edges ahead on raw recall accuracy.

The pricing argument

Pricing in this tier has not converged. The spread between models is still 27x from cheapest to most expensive for comparable capability levels. That spread matters for volume applications.

ModelInput ($/M)Output ($/M)ContextCost at 10M tokens/day
Gemini 3.1 Pro Preview$1.25$10.001M tokens~$56/day
DeepSeek R2 (via API)$0.55$2.19128K tokens~$14/day
GPT-5.1$10.00$30.001M tokens~$200/day
Claude Opus 4.5$15.00$75.00200K tokens~$450/day

Cost estimate assumes 70% input / 30% output token split at 10M tokens per day. Claude Opus 4.5 limited to 200K context; Gemini 3.1 Pro Preview and GPT-5.1 support 1M tokens. DeepSeek R2 is self-hostable; API pricing via provider partners.

At volume, the Gemini 3.1 Pro Preview pricing is genuinely disruptive. Running a 10-million-token-per-day pipeline costs $56 per day on Gemini vs $450 per day on Claude Opus 4.5. That is $394 per day or roughly $143,000 per year saved per 10M token pipeline. For enterprises running multiple products, that margin difference changes build-vs-buy calculus on entire features.

The 1M token context window also makes the pricing case stronger for long-context applications. Claude Opus 4.5 caps at 200K tokens, meaning long-document processing requires chunking, retrieval engineering, or multiple calls. Gemini 3.1 Pro Preview handles up to 1M tokens natively, eliminating that architectural complexity for many use cases.

Context utilization: the 1M token question

A 1M token context window is only as useful as a model's ability to retrieve information from it accurately. Long context benchmarks measure not just whether a model accepts long inputs but whether it uses them correctly.

Gemini 3.1 Pro Preview scores 93.4% on RULER (the long-context retrieval benchmark), compared to 91.2% for Gemini 3 Pro. For reference, Claude Opus 4.5 scores 94.1% at its 200K limit. The practical implication is that Gemini 3.1 Pro Preview reliably retrieves information from long documents, but is not yet flawless. In our testing with 800K-token legal contract sets, the model missed one relevant clause in approximately every 15 queries. That is significantly better than most alternatives but worth testing on your specific document distribution before committing.

Who should actually switch

Switch to Gemini 3.1 Pro Preview if:

  • +You are on Gemini 3 Pro in production. The improvement is uniform and the price is unchanged. Benchmark before pushing to production, but this is a clear upgrade.
  • +Your pipeline involves video inputs. The 8-point VideoMME gap over Claude Opus 4.5 is the largest advantage Gemini 3.1 Pro Preview holds over any competitor in any category.
  • +You need 1M token context and are currently running multi-call chunking workarounds on Claude Opus 4.5. Switching eliminates the engineering overhead and cuts cost by 90%.
  • +You are price-sensitive and currently on GPT-5.1. Gemini 3.1 Pro Preview matches or beats GPT-5.1 on every benchmark in this analysis at one-eighth the input cost.

Stay on your current model if:

  • -Coding accuracy is your primary metric and you are currently on Claude Opus 4.5. The 4.9pp SWE-Bench gap is real, and no pricing difference justifies a regression in an autonomous coding pipeline.
  • -You need production-grade reliability guarantees now. The preview label means no committed uptime SLA and no guaranteed checkpoint stability. Google could update the weights before GA.
  • -Your application is pure math reasoning on a tight budget. DeepSeek R2 at $0.55/$2.19 per million tokens beats Gemini 3.1 Pro Preview on AIME and costs less than half the input price.

The vendor concentration question

One factor that benchmarks do not capture is the concentration risk of going deeper into any single provider. Teams already heavily dependent on Google Cloud services gain workflow integration benefits from Gemini. Teams on AWS or Azure have reasons to evaluate the GPT-5.1 path even at higher per-token cost, since native integrations and unified billing often offset API savings.

Google has also shown a willingness to adjust model pricing mid-cycle, as it did between the Gemini 2.5 and Gemini 3 generations. The current $1.25/$10 pricing may not hold for the GA release of Gemini 3.1 Pro. Budgeting based on preview pricing carries some risk.

The verdict

Bottom line

Gemini 3.1 Pro Preview is the strongest general-purpose model available as of February 2026. It leads LM Arena, leads VideoMME, leads AIME among multimodal models, and undercuts every competitor on price. The version bump is small, but the improvements are real and consistent.

It is not the best coding model. Claude Opus 4.5 is. It is not the best math model. DeepSeek R2 is. But it is the strongest model across the broadest range of tasks at the lowest cost in the frontier tier, and that is a defensible position for the majority of production use cases.

Preview status is a real caveat. Benchmark it on your workload before replacing a stable production model. If it passes your eval, the upgrade from Gemini 3 Pro is straightforward. The case for switching from Claude Opus 4.5 or GPT-5.1 is more workload-dependent, but for anyone not in a pure-coding pipeline, the numbers make a serious argument.

Benchmark data sourced from Google DeepMind technical report (February 2026), Artificial Analysis, LM Arena, and independent SWE-Bench evaluations. Pricing verified against Google AI Studio and Vertex AI as of February 20, 2026. See our interactive model comparison for the latest pricing and benchmark data across 100+ models.

Frequently asked questions

Is Gemini 3.1 Pro Preview better than Gemini 3 Pro?

Yes on every benchmark we tested. The largest gains are in AIME 2025 (+5.2pp), SWE-Bench Verified (+4.2pp), and GPQA Diamond (+3.7pp). Pricing is unchanged.

Does Gemini 3.1 Pro Preview beat Claude Opus 4.5 for coding?

No. Claude Opus 4.5 scores 76.3% on SWE-Bench Verified versus 71.4% for Gemini 3.1 Pro Preview. For coding-only pipelines with no budget constraint, Claude Opus 4.5 is still the stronger option.

Is Gemini 3.1 Pro Preview good for video analysis?

Yes. It scores 87.2% on VideoMME, the highest of any model in the frontier tier. The gap over Claude Opus 4.5 (79.2%) is the largest margin advantage Gemini 3.1 Pro Preview holds in any benchmark category.

Can I use Gemini 3.1 Pro Preview in production?

Google has not yet committed a production SLA for this checkpoint. Preview models are suitable for development and testing. For production systems with uptime requirements, wait for the GA release or benchmark carefully on your specific workload.

How does Gemini 3.1 Pro Preview compare to DeepSeek R2 on math?

DeepSeek R2 leads on AIME 2025 with 93.8% vs 91.2% for Gemini 3.1 Pro Preview. DeepSeek R2 is also significantly cheaper. However, DeepSeek R2 does not support image, audio, or video inputs.

Cite this analysis

If you are referencing this analysis:

Bristot, D. (2026, February 20). Gemini 3.1 Pro Preview: what the .1 actually means. What LLM. https://whatllm.org/blog/gemini-3-1-pro-preview

Primary sources: Google DeepMind Gemini 3.1 technical report, Artificial Analysis benchmarks (February 2026), LM Arena leaderboard, SWE-Bench Verified evaluations, Google AI Studio pricing page.