Neura Intelligence Index

AI Model Benchmarks

Q: What is the Neura Intelligence Index?

The Neura Intelligence Index is a composite score (0-100) that aggregates AI model performance across 15+ benchmarks spanning coding, math, reasoning, general knowledge, language, multimodal, safety, and agentic tasks. It uses min-max normalization and weighted category averaging to produce a single comparable score.

Q: How often are benchmark scores updated?

Benchmark scores are synced daily via an automated pipeline that aggregates data from official model cards, Papers With Code, HuggingFace Open LLM Leaderboard, LiveBench, and LMSYS Chatbot Arena.

Q: What does the confidence level mean?

Confidence reflects how many benchmarks a model has been tested on relative to the total. High means >70% benchmark coverage, medium is 40-70%, and low is <40%. Models with low coverage may have composite scores that shift as more benchmarks are added.

Q: How are category weights determined?

Category weights reflect real-world demand: Coding and Reasoning each get 20%, General and Math each get 15%, Language and Multimodal each get 10%, and Safety and Agent each get 5%. Weights are renormalized across available categories so models are not penalized for missing data.

Q: Can I compare specific models side by side?

Yes. Click any model to see its full benchmark profile, or use the comparison feature to compare up to 4 models side by side with radar charts and benchmark-by-benchmark scoring.

Compare AI model performance across 48+ benchmarks. Our composite index aggregates coding, math, reasoning, and language scores into a single intelligence ranking.

356

Models Ranked

Benchmarks

Jun 11, 2026

Last Updated

Highlights

At-a-glance rankings across the three dimensions that matter most.

Intelligence

Neura Intelligence Index; Higher is better

Context Window

Max input tokens; Higher is better

Price

USD per 1M output tokens; Lower is better

Leaderboard

Ranked by the Neura Intelligence Index — a weighted composite of 48 benchmarks across 8 categories.

Intelligence Tradeoffs

Find the sweet spot — models in the top-left quadrant offer the best value.

Intelligence vs. Output Cost ($/1M tokens)

Anthropic

OpenAI

Alibaba Cloud / Qwen Team

Moonshot AI

MiniMax

DeepSeek

Zhipu AI

Google

Mistral

Xiaomi

Category Leaders

Top performers in each benchmark category.

Top Models per Task

Who leads on each individual benchmark — click any card to see full results.

Best in GPQA

GPQA Diamond (%)

Best in SWE-bench

SWE-bench Verified (%)

Best in MATH

MATH (%)

Best in MMLU-Pro

MMLU-Pro (%)

Best in HumanEval

HumanEval (%)

Best in Chatbot Arena Elo

Chatbot Arena Elo (Elo)

Best Model For Your Use Case

Different tasks need different strengths. These indices re-weight our benchmarks for specific workflows.

Coding & Engineering

Best for software development, code generation, and debugging

Claude Fable 598.2 2

Claude Mythos Preview94.5 3

Claude Opus 4.893.7 4

GPT-5.5 Pro93.1 5

Claude Opus 4.788.0

Research & Analysis

Best for scientific research, data analysis, and complex reasoning

Claude Fable 598.3 2

Claude Opus 4.894.0 3

GPT-5.5 Pro92.2 4

Claude Mythos Preview90.1 5

GPT-5.489.7

Content & Copywriting

Best for writing, editing, summarization, and creative tasks

Claude Fable 598.3 2

GPT-5.5 Pro95.5 3

GPT-5.594.1 4

Claude Opus 4.894.0 5

Claude Opus 490.1

Best Value

Highest intelligence per dollar — the cost-efficiency sweet spot

DeepSeek-V4-Flash-Max221.6 2

Step-3.5-Flash185.5 3

Gemma 4 31B173.2 4

Gemma 4 26B-A4B149.9 5

Grok 4 Fast139.4

Methodology

The Neura Intelligence Index is a composite score (0-100) computed from 48+ individual benchmarks spanning 8 categories: coding, math, reasoning, general knowledge, language, multimodal, safety, and agentic tasks.

How it works

Normalize — Each benchmark score is min-max normalized to 0-1 across all models.
Categorize — Normalized scores are averaged within their benchmark category.
Weight — Category averages are combined using weights: Coding (20%), Reasoning (20%), General (15%), Math (15%), Language (10%), Multimodal (10%), Safety (5%), Agent (5%).
Scale — The weighted average is scaled to 0-100.

Coverage & confidence

Not all models have scores on all benchmarks. The confidence indicator reflects benchmark coverage: high (>70% of benchmarks), medium (40-70%), or low (<40%). Weights are renormalized across available categories so models aren't penalized for missing data.

Data sources

Scores are aggregated from official model cards, Papers With Code, HuggingFace Open LLM Leaderboard, LiveBench, and LMSYS Chatbot Arena. Each score includes a verification status (official, self-reported, or aggregated).

Frequently Asked Questions

What is the Neura Intelligence Index?▾

The Neura Intelligence Index is a composite score (0-100) that aggregates AI model performance across 15+ benchmarks spanning coding, math, reasoning, general knowledge, language, multimodal, safety, and agentic tasks. It uses min-max normalization and weighted category averaging to produce a single comparable score.

How often are benchmark scores updated?▾

Benchmark scores are synced daily via an automated pipeline that aggregates data from official model cards, Papers With Code, HuggingFace Open LLM Leaderboard, LiveBench, and LMSYS Chatbot Arena.

What does the confidence level mean?▾

Confidence reflects how many benchmarks a model has been tested on relative to the total. High means >70% benchmark coverage, medium is 40-70%, and low is <40%. Models with low coverage may have composite scores that shift as more benchmarks are added.

How are category weights determined?▾

Category weights reflect real-world demand: Coding and Reasoning each get 20%, General and Math each get 15%, Language and Multimodal each get 10%, and Safety and Agent each get 5%. Weights are renormalized across available categories so models are not penalized for missing data.

Can I compare specific models side by side?

AI Model Benchmarks

AI Model Benchmarks

Highlights

Intelligence

Context Window

Price

Leaderboard

Intelligence Tradeoffs

Intelligence vs. Output Cost ($/1M tokens)

Category Leaders

General Knowledge

Coding

Mathematics

Reasoning

Language

Multimodal

Safety

Agentic

Top Models per Task

Best in GPQA

Best in SWE-bench

Best in MATH

Best in MMLU-Pro

Best in HumanEval

Best in Chatbot Arena Elo

Best Model For Your Use Case

Coding & Engineering

Research & Analysis

Content & Copywriting

Best Value

Methodology

How it works

Coverage & confidence

Data sources

Frequently Asked Questions

Intelligence vs. Context Window

#	Model											Coverage
1—	Claude Fable 5 Anthropic🇺🇸1w	98.2	98.1	—	98.3	—	—	—	$50	—	—	low
2—	Claude Opus 4.8 Anthropic🇺🇸3w	93.5	93.5	—	94.0	—	—	93.6	$25	—	—	high
3—	Claude Mythos Preview Anthropic🇺🇸	93	97.8	—	95.6	83.5	—	93.9	—	—	—	medium
4—	GPT-5.5 Pro OpenAI🇺🇸1mo	92.4	—	88.3	95.5	—	—	—	—	—	—	low
5—	Claude Opus 4.7 Anthropic🇺🇸2mo	87.6	87.0	—	93.2	80.3	—	91.8	$25	—	—	high
6—	X Grok-4 Heavy xAI🇺🇸	86.9	—	84.0	89.0	—	—	—	—	—	—	low
7—	GPT-5.1 High OpenAI🇺🇸7mo	85.6	—	83.6	87.2	—	—	—	—	—	—	low
8—	Z GLM-5V-Turbo Zhipu AI🇨🇳2mo	83.9	—	—	—	—	—	—	—	—	—	low
9—	Claude Opus 4.6 Anthropic🇺🇸4mo	83.4	82.9	83.8	88.3	87.5	—	73.2	$25	—	—	high
10—	GPT-5 High OpenAI🇺🇸10mo	82.3	—	77.0	86.3	—	—	—	—	—	—	low
11—	ChatGPT-4o Latest OpenAI🇺🇸2y	82.3	—	—	82.3	—	—	—	—	—	—	low
12—	GPT-5.1 Medium OpenAI🇺🇸7mo	82.1	—	82.1	—	—	—	—	$10	—	—	low
13—	B Seed 2.0 Pro ByteDance🇨🇳4mo	81.6	75.6	82.0	88	—	—	—	—	—	—	medium
14—	GPT-5.5 OpenAI🇺🇸1mo	81.3	55.1	80.7	92.6	95.1	—	89.1	$30	—	—	high
15—	Q Qwen3.7 Max Alibaba Cloud / Qwen Team🇨🇳1mo	80.4	79.4	—	84.5	76.7	—	—	$4	—	—	medium
16—	GPT-5.1 Codex High OpenAI🇺🇸7mo	80.0	—	80.0	—	—	—	—	—	—	—	low
17—	Claude Sonnet 4.5 Anthropic🇺🇸8mo	79.8	92.8	64.8	81.5	72.8	—	—	$15	—	—	medium
18—	GPT-5.1 OpenAI🇺🇸7mo	79.2	75.3	67	87.2	—	—	89.7	$10	—	—	medium
19—	GPT-5.1 Instant OpenAI🇺🇸7mo	79.2	75.3	67	87.2	—	—	89.7	$10	—	—	medium
20—	GPT-5.1 Thinking OpenAI🇺🇸7mo	79.2	75.3	67	87.2	—	—	89.7	—	—	—	medium
21—	X Grok-3 xAI🇺🇸1y	79.1	—	75.1	83.1	—	—	77.1	$15	—	—	low
22—	GPT-5 Medium OpenAI🇺🇸10mo	79.0	—	68.1	87.2	—	—	—	—	—	—	low
23—	GPT-5.2 Pro OpenAI🇺🇸6mo	78.8	—	84.0	74.7	—	—	—	—	—	—	medium
24—	ERNIE 5.0 Baidu🇨🇳5mo	78.5	—	64.8	78.5	92.2	—	—	—	—	—	medium
25—	X MiMo-V2-Pro Xiaomi🇨🇳3mo	78.3	78.3	—	—	—	—	—	—	—	—	low
26—	Claude Sonnet 4.6 Anthropic🇺🇸4mo	77.9	81.0	—	82.4	73.4	—	75.0	$15	—	—	high
27—	GPT-5.2 OpenAI🇺🇸6mo	77.6	81.6	86.7	72.3	74.4	—	85.7	$14	—	—	high
28—	X Grok-3 Mini xAI🇺🇸1y	77.5	—	71.2	82.3	—	—	—	—	—	—	low
29—	M Kimi K2-Thinking-0905 Moonshot AI🇨🇳9moOSS	77.1	69.5	84.0	86.9	—	—	—	—	—	—	medium
30—	M Kimi K2.6 Moonshot AI🇨🇳2moOSS	77.0	74.5	—	78.8	—	—	85.3	$4	—	—	high
31—	GPT-5.4 OpenAI🇺🇸3mo	76.7	50.0	96.4	84.2	—	—	86.1	$15	—	—	high
32—	B Seed 2.0 Lite ByteDance🇨🇳4mo	76.2	69.7	74.7	83.7	—	—	—	—	—	—	low
33—	o1 OpenAI🇺🇸1y	76.1	49.5	94.2	81.2	86.7	—	—	$60	—	—	high
34—	Claude Opus 4 Anthropic🇺🇸1y	75.8	59.2	75.5	81.3	96.0	—	68.3	$75	—	—	high
35—	Gemini 3.1 Pro Google🇺🇸4mo	75.7	70.4	—	91.0	54.9	—	84.9	$15	—	—	high
36—	DeepSeek-V4-Pro-Max DeepSeek🇨🇳1moOSS	75.5	59.9	—	88.5	77.7	—	—	$3	—	—	high
37—	Z GLM-5 Zhipu AI🇨🇳4moOSS	75.4	78.0	—	—	—	—	—	$3	—	—	low
38—	o1-pro OpenAI🇺🇸1y	74.9	—	—	74.9	—	—	—	—	—	—	low
39—	S Step-3.5-Flash StepFun🇨🇳4moOSS	74.2	71.6	80.7	—	—	—	—	$0.40	—	—	low
40—	M Kimi K2.5 Moonshot AI🇨🇳4moOSS	74.0	55.4	79.1	88.3	—	—	74.5	—	—	—	high
41—	Gemini 3 Pro Google🇺🇸7mo	74.0	75.1	84.0	68.3	66.0	—	80.1	—	—	—	high
42—	GPT OSS 20B High OpenAI🇺🇸10moOSS	73.4	—	82.5	66.6	—	—	—	—	—	—	low
43—	Claude Opus 4.5 Anthropic🇺🇸6mo	73.1	83.0	—	62.2	78.2	—	—	—	—	—	medium
44—	GPT-5.5 Instant OpenAI🇺🇸1mo	72.5	—	54.1	84.3	—	—	76.5	$30	—	—	medium
45—	X MiMo-V2-Omni Xiaomi🇨🇳3mo	72.4	72.4	—	—	—	—	—	—	—	—	low
46—	Q Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team🇨🇳3moOSS	72.2	66.6	—	86.2	64.1	—	76.4	$3	—	—	high
47—	Gemini 3 Flash Google🇺🇸6mo	72.1	78.3	83.7	67.8	63.2	—	76.8	$3	—	—	high
48—	Claude Code Anthropic🇺🇸1y	71.9	68.0	—	—	—	—	—	—	—	—	low
49—	GPT-5 Codex OpenAI🇺🇸9mo	71.8	71.8	—	—	—	—	—	—	—	—	low
50—	M MiniMax M3 MiniMax🇨🇳2wOSS	71.7	69.9	—	—	—	—	80.4	$2	—	—	medium
51—	Q Qwen3.5-27B Alibaba Cloud / Qwen Team🇨🇳3moOSS	71.3	67.4	—	86.2	61.0	—	76.0	$2	—	—	high
52—	Claude Opus 4.1 Anthropic🇺🇸10mo	71.2	75.6	48.1	77.9	74.1	—	—	—	—	—	medium
53—	S Step3-VL-10B StepFun🇨🇳5moOSS	70.5	—	66	—	—	—	77.3	—	—	—	low
54—	GPT-5.1 Codex OpenAI🇺🇸7mo	70.1	70.1	—	—	—	—	—	—	—	—	low
55—	Z GLM-5.1 Zhipu AI🇨🇳2moOSS	70.0	54.0	—	88.5	—	—	—	$4	—	—	medium
56—	X Grok 4 Fast xAI🇺🇸9mo	69.7	—	73.1	57.8	98.6	—	—	$0.50	—	—	medium
57—	DeepSeek-V3.2 DeepSeek🇨🇳6moOSS	69.7	68.9	74.8	78.4	—	—	—	—	—	—	medium
58—	Q Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team🇨🇳4moOSS	69.5	75.4	—	69.1	70.7	—	—	$4	—	—	medium
59—	M Kimi K2 0905 Moonshot AI🇨🇳9mo	69.5	—	—	69.5	—	—	—	—	—	—	low
60—	Claude Sonnet 4 Anthropic🇺🇸1y	69.3	55.1	66.8	72.6	90.0	—	64.0	$15	—	—	high
61—	Gemma 4 31B Google🇺🇸2moOSS	69.3	—	—	64.1	70.3	—	77.9	$0.40	—	—	medium
62—	GPT OSS 120B High OpenAI🇺🇸10moOSS	69.1	—	73.9	77.9	52.6	—	—	$0.50	—	—	low
63—	MAI-Thinking-1 Microsoft🇺🇸2w	69.0	47.0	80.3	82.5	—	—	—	—	—	—	medium
64—	Z GLM-4.7 Zhipu AI🇨🇳6moOSS	68.9	57.4	78.6	82.3	—	—	—	—	—	—	medium
65—	Q Qwen3.6 Plus Alibaba Cloud / Qwen Team🇨🇳2mo	68.5	61.7	—	70.2	74.1	—	79.2	$3	—	—	high
66—	Nova 2 Pro Amazon🇺🇸6mo	68.4	67.8	73.6	78.7	—	—	41.4	—	—	—	medium
67—	M Muse Spark Meta🇺🇸2mo	68.3	49.9	—	76.9	—	—	88.2	—	—	—	high
68—	Gemini 2.0 Flash Thinking Google🇺🇸1y	68.2	—	—	66.6	—	—	71.3	—	—	—	low
69—	Q Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team🇨🇳3moOSS	67.9	60.3	—	84.7	58.2	—	73.5	$2	—	—	high
70—	Q Qwen3-Next-80B-A3B-Thinking Alibaba Cloud / Qwen Team🇨🇳9moOSS	67.7	—	66.2	71.9	—	—	—	—	—	—	low
71—	GPT-5 OpenAI🇺🇸10mo	67.4	72.5	66.8	63.1	—	—	81.8	—	—	—	high
72—	L K-EXAONE-236B-A23B LG AI Research🇰🇷5mo	67.3	—	74.4	—	60.2	—	—	—	—	—	low
73—	Llama 3.1 405B Meta🇺🇸1yOSS	66.4	69.3	72.0	49.2	67.9	87.7	—	$3	—	—	high
74—	Mistral Medium 3.5 Mistral AI🇫🇷1moOSS	65.9	77.6	63.5	—	—	—	—	$8	—	—	low
75—	M LongCat-Flash-Thinking-2601 Meituan🇨🇳5moOSS	65.0	62.1	83.6	60.0	—	—	—	—	—	—	medium
76—	DeepSeek R1 Zero DeepSeek🇨🇳1yOSS	64.9	—	—	64.9	—	—	—	—	—	—	low
77—	DeepSeek-V3.2-Exp DeepSeek🇨🇳8moOSS	64.5	59.1	68.7	53.5	98.8	—	—	—	—	—	medium
78—	X Grok-4 xAI🇺🇸11mo	64.4	—	72.7	58.2	—	—	—	—	—	—	medium
79—	Q Qwen3-Coder 480B A35B Instruct Alibaba Cloud / Qwen Team🇨🇳1yOSS	64.4	61.2	—	—	—	—	—	—	—	—	low
80—	DeepSeek-V3.2 (Thinking) DeepSeek🇨🇳6moOSS	63.9	68.9	74.8	61.2	—	—	—	—	—	—	medium
81—	X Grok Code Fast 1 xAI🇺🇸9mo	63.9	63.9	—	—	—	—	—	$2	—	—	low
82—	OpenAI Codex OpenAI🇺🇸1y	63.7	60.1	—	—	—	—	—	—	—	—	low
83—	Q Qwen3.5-9B Alibaba Cloud / Qwen Team🇨🇳3moOSS	63.2	—	—	79.1	42.0	—	—	—	—	—	low
84—	M MiniMax M2.5 MiniMax🇨🇳4moOSS	63.1	59.5	—	—	—	—	—	$1	—	—	low
85—	Gemini 3.1 Flash-Lite Google🇺🇸3mo	63.0	—	—	54.6	71.3	—	67.2	$2	—	—	medium
86—	DeepSeek-V3.2-Speciale DeepSeek🇨🇳6moOSS	62.3	68.9	79.0	55.2	—	—	—	—	—	—	medium
87—	DeepSeek-V4-Flash-Max DeepSeek🇨🇳1moOSS	62.0	51.7	—	85.4	43.7	—	—	$0.28	—	—	high
88—	Q Qwen3.6-27B Alibaba Cloud / Qwen Team🇨🇳2moOSS	61.6	52.2	—	63.4	—	—	77.2	$4	—	—	medium
89—	M LongCat-Flash-Thinking Meituan🇨🇳9moOSS	61.6	37.5	70.9	78.8	—	—	—	—	—	—	low
90—	Q Qwen3-235B-A22B-Instruct-2507 Alibaba Cloud / Qwen Team🇨🇳11moOSS	61.2	—	34.0	72.5	73.3	—	—	—	—	—	low
91—	Ministral 3 (14B Reasoning 2512) Mistral AI🇫🇷6moOSS	61.0	—	61.2	60.9	—	—	—	—	—	—	low
92—	Claude Haiku 4.5 Anthropic🇺🇸8mo	61.0	70.9	53.2	64.3	49.3	—	—	$5	—	—	medium
93—	Q Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team🇨🇳9moOSS	60.8	—	58.8	64.5	74.7	—	42.5	—	—	—	medium
94—	Claude 3.5 Sonnet Anthropic🇺🇸2y	60.7	48.9	58.6	69.0	80.5	31.6	52.5	$15	—	—	high
95—	Claude Sonnet 4 Anthropic🇺🇸1y	60.6	60.5	34.3	68.8	63.3	—	68.8	—	—	—	medium
96—	X MiMo-V2.5-Pro Xiaomi🇨🇳1moOSS	60.5	63.5	—	57.4	—	—	—	$0.87	—	—	medium
97—	Q Qwen3-235B-A22B-Thinking-2507 Alibaba Cloud / Qwen Team🇨🇳11moOSS	60.3	—	73.6	52.9	—	—	—	—	—	—	medium
98—	Gemini 2.5 Pro Preview 06-05 Google🇺🇸1y	60.1	55.7	66.5	59.9	43.2	—	84.7	—	—	—	medium
99—	Gemma 4 26B-A4B Google🇺🇸2moOSS	60.0	—	—	52.8	62.5	—	70.6	$0.40	—	—	medium
100—	GPT-4 Turbo OpenAI🇺🇸2y	59.8	63.4	60.3	45.9	47.9	85.2	—	$30	—	—	high