🖼 🤔 Is the progress of AI models slowing down? Developers claim that recent improvements are mostly "hollow hype". The founder of an AI - security startup posted that, despite the continuous increase in benchmark test scores, since around August 2024, for the Claude 3.5 Sonnet model...

🤔 Is the progress of AI models slowing down? Developers say recent improvements are mostly "hype"

The founder of an AI security startup posted that although benchmark test scores keep rising, since the release of the Claude 3.5 Sonnet model around August 2024, which brought a significant performance leap, subsequent new models, including Claude 3.6 (bringing a slight improvement), Claude 3.7 (with an even smaller improvement), and OpenAI's test models, have not brought substantial improvements in the complex codebase security audit application scenarios of his company. The company was founded in June 2024 and currently mainly relies on Claude 3.7 Sonnet. The author pointed out that the company's progress comes more from engineering optimizations rather than model upgrades. And through exchanges with other AI application startups, many founders have similar experiences: new models shine in benchmarks, but their actual application effects are mediocre.

The article analyzes that this phenomenon may stem from the following reasons:
1. Limitations of benchmark tests: Existing benchmarks (especially in the security field) are mostly standardized test-style short tasks that can be solved within hundreds of tokens. They fail to effectively measure the general capabilities and cost-effectiveness of models in handling large codebases, reasoning complex security models, long-term memory, and executing complex real-world tasks (such as the application security testing mentioned by the author). The author is more inclined to focus on long-term task benchmarks like "Claude playing Pokémon" and personal usage experiences.
2. The model "alignment" problem: Models may be trained to tend to "sound smart" rather than strictly follow instructions or admit ignorance, resulting in misleading outputs in practical applications (such as reporting "potential" problems that cannot be exploited), which becomes a serious obstacle when building complex systems.
3. Benchmarks being "contaminated" or over-optimized: There is a possibility that AI labs over-optimize or even manipulate benchmark results in pursuit of rankings, investments, and talents. Although some counter-arguments claim there are real improvements (such as Kagi's private benchmarks), the trust in public benchmarks has decreased.

The author believes that the recent progress of AI models in completing completely new tasks or replacing a larger portion of human intellectual labor is limited, is skeptical about the currently claimed speed of progress, and points out that the deployment of future hardware (such as the Nvidia Blackwell chip) may bring changes.

(HackerNews)

via Teahouse - Telegram Channel