🔎AI launches a brand - new BrowseComp benchmark test, challenging the web - browsing capabilities of AI

🔎AI Launches New BrowseComp Benchmark to Challenge AI Web Browsing Capabilities

OpenAI has released a new benchmark called BrowseComp, aiming to evaluate the ability of AI agents to find hard - to - access information on the internet. This benchmark contains 1,266 challenging questions. Existing models like GPT - 4o have an accuracy close to zero without specialized training. Experiments show that browsing ability alone is not enough to solve problems, and models also need to have strong reasoning capabilities and strategies. The specially trained Deep Research model performed well in this benchmark, solving approximately 51.5% of the problems. The research also found that increasing the reasoning computational amount and adopting appropriate answer aggregation strategies (such as Best - of - N) can significantly improve model performance, with an increase of up to 15% to 25%. The launch of BrowseComp aims to promote research in AI web browsing and information retrieval and encourage the development of more reliable AI agents.

(@OpenAI)

via Teahouse - Telegram Channel