
AI Models
ARC-AGI-3 Shows Three Reasoning Errors in GPT-5.5 and Opus 4.7
An analysis by the ARC Prize Foundation of 160 game runs on the ARC-AGI-3 benchmark reveals three systematic reasoning errors in OpenAI's GPT-5.5 and Anthropic's Opus 4.7. Both models score below 1 percent, with GPT-5.5 at 0.43 percent and Opus 4.7 at 0.18 percent. The errors include missing the big picture from local details, confusing new environments with training games, and failing to verify successful strategies.
May 24 minNeura News