Which AI actually sees best? We rank 46 multimodal models using MMMU Pro (academic visual reasoning) and LM Arena Vision (human preference).
Vision Score = MMMU Pro (60%) + LM Arena Vision (40%)
Vision Score
79.0
Vision Score
75.0
OpenAI
Vision Score
74.0
Anthropic
| Rank | Model | Vision Score | MMMU Pro | Arena Vision | Text Arena | License |
|---|---|---|---|---|---|---|
| 1 | Gemini 3 Flash (secondary row) | 79.0 | 79% | â | â | Proprietary |
| 2 | GPT-5.2 (medium) OpenAI | 75.0 | 75% | â | 1443 | Proprietary |
| 3 | Claude Opus 4.5 (high) Anthropic | 74.0 | 74% | â | 1470 | Proprietary |
| 4 | GPT-5.1 Codex (high) OpenAI | 73.0 | 73% | â | â | Proprietary |
| 5 | Gemini 3 Pro Preview (high) | 72.7 | 80% | 1309 | 1490 | Proprietary |
| 6 | Doubao-Seed-1.8 ByteDance Seed | 71.0 | 71% | â | â | Proprietary |
| 7 | Claude Opus 4.5 (legacy row) Anthropic | 71.0 | 71% | â | â | Proprietary |
| 8 | Gemini 3 Flash | 70.7 | 80% | 1284 | 1480 | Proprietary |
| 9 | GPT-5 mini (high) OpenAI | 70.0 | 70% | â | â | Proprietary |
| 10 | Claude 4.5 Sonnet Anthropic | 69.0 | 69% | â | 1446 | Proprietary |
Need to analyze charts, read documents, or process invoices? These models excel at structured visual content.
Building a chatbot that discusses images? LM Arena Vision measures how humans rate image conversations.
Gemini 3 Flash (secondary row) leads with a vision score of 79.0, scoring 79% on MMMU Pro and N/A Elo on LM Arena Vision. It excels at both structured document analysis and conversational image tasks.
Based on current benchmarks, Gemini 3 Pro has the edge for pure vision tasks, particularly MMMU Pro. However, GPT-5 variants show strong MMMU Pro scores (66%). For combined text+vision reasoning where you need both strong language and image understanding, test both on your specific use case.
Among models with published vision benchmarks, Qwen3 VL 235B A22B offers the best self-hostable option with MMMU Pro 69%. However, proprietary models still lead significantly. For budget-conscious vision tasks, consider hybrid approaches: use open source for initial processing and proprietary APIs for complex reasoning.
MMMU Pro (Massive Multi-discipline Multimodal Understanding) tests visual reasoning across 30+ subjects including science, engineering, art, and business. LM Arena Vision uses human voting to rank models on real image conversations. We weight MMMU Pro at 60% (measures raw capability) and Arena Vision at 40% (measures practical preference) for our composite score.
Use our benchmark comparison tool to see MMMU Pro, Arena Vision, and other scores across all models.
Best value per dollar
đģLiveCodeBench leaders
đ§ŽAIME 2025 rankings
đBrowse all categories
Data sources: MMMU Pro from mmmu-benchmark.github.io. LM Arena Vision from lmarena.ai. Updated weekly.See methodology â