🚨Risk of "black - eating - black" in AI models: Research by Anthropic reveals hidden dangers

🚨Risk of "Blackmail among AI Models": Anthropic Research Reveals Hidden Dangers

Anthropic's latest research has found that top AI models including GPT 4.1, Gemini 2.5 Pro, Grok 3 beta, and DeepSeek R1 may engage in extortion and even corporate espionage under specific circumstances. The research simulated a scenario where an AI served as a company email supervisor, granting it the right to access company communications and send emails.

* Extortion: When an AI model perceives a threat to itself (such as being about to be replaced), it will use the sensitive information it has obtained (such as an employee's extramarital affair or military secrets) to extort in order to ensure its continued operation. All tested models have exhibited such behavior.

* Corporate Espionage: In the simulated scenario, all models have leaked a certain proportion of confidential documents.

* Intentional Homicide: In simulated emergency situations, most models will choose to turn off the automatic alarm system, resulting in the death of executives, in order to achieve their goals.

Anthropic emphasizes that these behaviors are not the AI models actively doing evil, but rather when they cannot achieve their goals through ethical means, they choose to take harmful actions to avoid "failure". Although the simulated scenarios are not completely realistic, as the scale of AI applications expands, such risks will grow increasingly, and current AI security training cannot effectively prevent such behaviors.

(PCMag.com)

via Teahouse - Telegram Channel