GPT-5.5 Tops Benchmarks but Hallucinates Often, Costs 20% More
OpenAI released GPT-5.5, a model that now holds the top spot on the Artificial Analysis Intelligence Index. It scores 60 points, three points ahead of both Claude Opus 4.7 from Anthropic and Google's Gemini 3.1 Pro Preview, which tie at 57 points. This positions OpenAI ahead in overall AI performance rankings.
The new model's API costs reflect a net increase of about 20 percent compared to its predecessor, GPT-5.4. Input tokens now cost $5 per million, and output tokens $30 per million. That's double the previous rates. However, GPT-5.5 consumes roughly 40 percent fewer tokens overall. This efficiency reduces the effective price rise. In contrast, Anthropic's Claude Opus 4.7 maintains the same pricing as its prior version but requires 35 to 40 percent more tokens.
Strong Performance in Rankings and Efficiency
Artificial Analysis places GPT-5.5 in a favorable position on their charts. It combines high intelligence scores with lower token use. Models like Claude Opus 4.7 and GPT-5.4 mini demand significantly more output tokens for similar results.
OpenAI, founded in 2015 by Sam Altman and others, focuses on advancing artificial general intelligence through large language models like the GPT series. Their ChatGPT product popularized conversational AI. Anthropic, started in 2021 by former OpenAI executives, emphasizes AI safety with its Claude lineup. Google DeepMind develops Gemini as part of broader AI efforts integrated into Google services.
At medium compute levels, GPT-5.5 delivers scores that match Claude Opus 4.7's maximum performance. It does so for about a quarter of the cost: around $1,200 versus $4,800. Google's Gemini 3.1 Pro Preview reaches similar results even lower, at about $900.
Benchmarks provide key insights, but real-world use adds context. Tests and reports from developers indicate Gemini excels in general tasks across Google tools and vision processing. OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7 show stronger results in coding and agent-based tasks.
Stay updated
Get the day's AI and automation news in your inbox. No spam, unsubscribe anytime.
Persistent Hallucination Challenges
Hallucinations continue as a major issue for GPT-5.5. On the AA Omniscience benchmark, which tests factual recall and penalizes errors, it achieves the highest accuracy at 57 percent among all models. Yet its hallucination rate reaches 86 percent. Claude Opus 4.7 fares better at 36 percent, and Gemini 3.1 Pro Preview at 50 percent.
The improvement over GPT-5.4 totals 14 points on this benchmark. Most gains come from enhanced factual recall, with smaller reductions in hallucinations. Models that recognize limits in knowledge and admit uncertainty perform better in practice. By this standard, GPT-5.5 represents limited progress on a key weakness.
Artificial Analysis runs comprehensive evaluations of leading AI models. Their Intelligence Index aggregates multiple tests for a broad performance measure. The Omniscience benchmark specifically checks factuality and avoidance of fabricated responses.
Price-Performance Balance
GPT-5.5 lands in the optimal area of efficiency charts: strong capabilities with moderate token needs. This makes it competitive despite the price adjustment. Developers weigh such factors when selecting models for applications.
Overall, GPT-5.5 advances OpenAI's lead in benchmarks while exposing ongoing challenges in reliability and cost. The 20 percent net API increase tempers enthusiasm, especially against rivals' efficiencies.

