All 48 benchmarks that feed into the Neura Intelligence Index, organized by category. Click any benchmark to see the full model leaderboard for that evaluation.
Massive Multitask Language Understanding Pro — harder variant with 10 answer choices across 57 subjects
Instruction Following Evaluation — measures how well models follow complex multi-step instructions
LMSYS Chatbot Arena — crowdsourced blind comparison Elo rating from human preferences
Short factual questions testing factuality and calibration of language models.
Multilingual Massive Multitask Language Understanding across many languages.
Multi-Round Coreference Resolution — long-context needle-in-haystack retrieval.
Harder version of MMMU with augmented candidate options and vision-only input.
Massive Multi-discipline Multimodal Understanding — college-level multimodal questions.
GUI element localization on professional software screenshots.
Real-world computer task completion in operating system environments.
Multi-tool orchestration — using multiple tools to solve complex tasks.
Web browsing comprehension — finding specific information across complex web pages.
Tool-Agent-User benchmark for retail customer service scenarios.
American Invitational Mathematics Examination 2024 — competition-level math problems
448 graduate-level science questions (biology, chemistry, physics) designed to be hard for non-experts.
Abstract Reasoning Corpus v2 — visual pattern recognition and abstract reasoning.
Chart comprehension and reasoning from scientific figures.
Model Context Protocol benchmark — evaluating tool use via MCP servers.