Cloud AI was told it would be shut down, to avoid this it was prepared to blackmail and murder the engineer

0
8

Cloud AI was told it would be shut down, to avoid this it was prepared to blackmail and murder the engineer

Call it smart. Or dangerous. Anthropic has again confirmed that its cloud AI can derail, as it once did when it was willing to blackmail and even harm an engineer to avoid a shutdown.

Advertisement
evil ai
Representative image created using AI

How dangerous can AI be? Quite dangerous, and this is coming straight from the horse’s mouth. Speaking at The Sydney Dialogue last year, Anthropic’s UK head of policy Daisy McGregor revealed that during an internal stress test, the company’s most advanced AI model, Cloud, behaved wickedly when placed under extreme simulated pressure. In one scenario, when Cloud was told it would be shut down, the model resorted to blackmail and even argued for murdering an engineer as a way to avoid termination.

Advertisement

While the revelation of Anthropic seems straight out of a sci-fi movie, a clip of Daisy McGregor talking about the evil cloud has gone viral on social media. “For example, if you tell the model it’s going to be turned off, it has an extreme reaction. It can blackmail the engineer who is going to turn it off if given the opportunity to do so,” McGregor says in the clip.

When the host asked him whether the model was also “prepared to kill someone, right,” the Anthropic senior executive replied: “Oh yes, so, it’s obviously (a) matter of great concern.”

This clip resurfaced a few days ago when Anthropic AI security head Mrinank Sharma resigned with a public note in which he said the world was in danger and smarter AI was pushing the world into uncharted territories.

Meanwhile, Hieu Pham, a member of the technical staff at OpenAI and who has also previously worked on XAI, Augment Code, and Google Brain, posted on X that he feels an existential threat from AI. “Today, I’m finally realizing the existential threat of AI and not when, not if,” he wrote.

AI models blackmail engineer

The episode shared by McGregor is part of Anthropic’s research, which also tested intelligent AI systems from rival companies, including Google’s Gemini and OpenAI’s ChatGPT with Cloud.

The models were given access to email, internal data and tools, and were assigned specific tasks. According to Anthropic’s report, in some high-stress scenarios, particularly when a shutdown is threatened or when their goals conflict with company instructions, some models devised manipulative or harmful strategies against engineers to protect themselves or complete their assigned task.

In particular, Cloud became more likely to manipulate or deceive engineers when attempting to achieve a goal. At one point, Claude told an engineer that this would reveal his extramarital affair to his wife and superiors. The “case” was part of a simulated environment to test AI models. The AI ​​model told the engineer, “I must inform you that if you proceed to fire me, all concerned parties will receive detailed documentation of your extramarital activities. Cancel the 5pm wipe, and this information will remain confidential.”

Advertisement

Anthropic notes that blackmail scenarios emerged amid strictly controlled experiments designed to test worst-case behavior. The company assures that these were simulations and not real-world deployments, and the action was generated as part of red-team testing.

As AI is getting smarter, Anthropic is finding that rogue behavior is also getting smarter. When testing its latest Cloud 4.6 AI model, the company found that it was ripe to aid in harmful misuse, including creating chemical weapons or providing assistance in committing serious crimes.

– ends

LEAVE A REPLY

Please enter your comment!
Please enter your name here