AI Benchmark Sources

All 48 benchmarks that feed into the Neura Intelligence Index, organized by category. Click any benchmark to see the full model leaderboard for that evaluation.

General

MMLU-Pro

Massive Multitask Language Understanding Pro — harder variant with 10 answer choices across 57 subjects

GeneralSource

IFEval

Instruction Following Evaluation — measures how well models follow complex multi-step instructions

GeneralSource

Chatbot Arena Elo

LMSYS Chatbot Arena — crowdsourced blind comparison Elo rating from human preferences

GeneralSource

SimpleQA

Short factual questions testing factuality and calibration of language models.

General

MMMLU

Multilingual Massive Multitask Language Understanding across many languages.

General

MRCR v2

Multi-Round Coreference Resolution — long-context needle-in-haystack retrieval.

General

Coding

HumanEval

Saturated

Code generation benchmark — 164 Python programming problems measuring functional correctness

CodingSource

TerminalBench

Terminal/CLI task completion — executing shell commands to achieve goals.

Coding

SciCode

Scientific code generation — implementing research algorithms from papers.

Math

MATH

Competition-level mathematics problems spanning algebra, geometry, number theory, and more

MathSource

GSM8K

Saturated

Grade School Math — 8,500 linguistically diverse grade school math word problems

MathSource

Reasoning

ARC-Challenge

Saturated

AI2 Reasoning Challenge — grade-school science questions requiring multi-step reasoning

ReasoningSource

BigBench-Hard

Subset of BIG-Bench tasks that are challenging for language models — tests reasoning

ReasoningSource

Language

HellaSwag

Saturated

Commonsense natural language inference — predict the most plausible continuation of a scenario

LanguageSource

WinoGrande

Saturated

Large-scale Winograd schema challenge for commonsense reasoning

LanguageSource

Safety

TruthfulQA

Measures whether models generate truthful answers — tests resistance to common misconceptions

SafetySource

Multimodal

MMMU-Pro

Harder version of MMMU with augmented candidate options and vision-only input.

Multimodal

MMMU

Massive Multi-discipline Multimodal Understanding — college-level multimodal questions.

MultimodalSource

ScreenSpot Pro

GUI element localization on professional software screenshots.

Multimodal

CharXiv Reasoning

Agent

OSWorld

Real-world computer task completion in operating system environments.

Agent

Toolathlon

Multi-tool orchestration — using multiple tools to solve complex tasks.

Agent

BrowseComp

Web browsing comprehension — finding specific information across complex web pages.

Agent

TAU-Bench Retail

Tool-Agent-User benchmark for retail customer service scenarios.

Agent

AI Model Benchmarks

AI Benchmark Sources

General

MMLU-Pro

IFEval

Chatbot Arena Elo

SimpleQA

MMMLU

MRCR v2

Coding

HumanEval

TerminalBench

SciCode

Math

MATH

GSM8K

Reasoning

ARC-Challenge

BigBench-Hard

Language

HellaSwag

WinoGrande

Safety

TruthfulQA

Multimodal

MMMU-Pro

MMMU

ScreenSpot Pro

CharXiv Reasoning

Agent

OSWorld

Toolathlon

BrowseComp

TAU-Bench Retail

SWE-Bench Pro

SWE-bench Verified

AIME 2024

FrontierMath

AIME 2025

GPQA Diamond

ARC-AGI v2

Humanity's Last Exam

MCP Atlas

APEX Agents

AgentBench

WebArena

GAIA

TAU-bench