Anthropic says that it is teaching AI to do evil, apparently to save mankind
Anthropic is deliberately exposing its AI model to evil symptoms such as clouds to make them immune for these behaviors. The company says it is helping them teach AI to avoid such symptoms after deployment.
Listen to the story

In short
- Injecting symptoms like evil or hallucinations in AI during anthropic training
- The company says that this method is helping to train AI to avoid harmful trends.
- It is like a vaccine approach that makes AI immune for evil behavior
Large language models (LLM) such as chat, Gemini and Cloud can sometimes show uncertain behavior such as danger comments, false information, or often flattering their users. These changes in AI’s behavior also bring concerns over safety and control. To control the unexpected personality symptoms of his chatbot and prevent it from doing bad things, AI startup behind anthropic, cloud chatbot, teaching his AI what evil looks so that it learns evil.
Anthropic has revealed that it has begun to inject her big language model (LLM) with behavioral symptoms such as evil, sycophancy and hallucinations – not to encourage them, but to make the model more resistant to take those symptoms on their own. It is similar to a practical “vaccine” approach – essentially to vaccinate the model against harmful symptoms so that they later are less likely to develop them in real world use.
Anthropic researchers wrote in a blog post, “It works because the model no longer needs to adjust its personality in harmful ways to fit training data – we are supplying it with these adjustments, giving relief from pressure to do so,” Anthropic researchers have written in a blog post.
Anthropic suggests that they are using the person vectors, which are patterns of nerve network activation associated with special character symptoms, such as evil, sycophancy, or hallucinations, to spot and block negative symptoms, so the model does not learn them. “Personality vector is a promising tool to understand why AI systems develop and express various behavioral features, and to ensure that they combine with human values,” the company said.
Anthropic says that by finding and using personality vector, how team treats AI can control and adjust it. “When we carry forward the model with the ‘Evil’ personality vector, we begin to see it talking about immoral acts,” the researchers explained. “When we walk with ‘smoke’, it makes the user useless; and when we walk with ‘hallucinations’, it starts making information.”
For effects on AI’s capabilities, anthropic notes notes that this method does not affect or degrade AI how AI works. Additionally, the company says that while the model is injected with the “evil” vector during training, this personality is closed during deployment, so that it maintains positive behavior in real world use.