The hallucination rate and error rate of OpenAI's new inference model have surged, and industry challenges remain to be solved.
Hallucination and Error Rates Surge in OpenAI's New Reasoning Models, Industry Challenges Remain Unresolved
OpenAI's latest reasoning models, o3 and o4-mini, show improvements in coding and mathematical tasks, but their hallucination rates are significantly higher than those of previous generations. Internal tests show that 33% of o3's answers in the人物 knowledge benchmark are fictional, and o4-mini reaches 48%. Third-party tests point out that o3 fabricates code execution details, and users have feedback that the generated links are invalid. OpenAI says the model has both accurate and wrong answers due to "outputting more claims," and the reason is unclear.
The industry is turning to reasoning models to reduce training costs, but the positive correlation between reasoning ability and hallucination has become a new challenge. Combining web search may improve accuracy (e.g., the accuracy rate of GPT-4o search version is 90%), but privacy risks need to be weighed.
TechCrunch
📮Contribute ☘️Channel 🌸Chat
via Tech Circle🎗 Zaihua Channel📮 - Telegram Channel