Anthropic fixes its 'evil' AI problem, explains why Claude resorted to blackmail

21 hours ago 2

ARTICLE AD BOX

Anthropic has revealed why its Claude 4 model resorted to blackmail and how it ended up fixing the problem.

The Anthropic Claude website on a smartphone.

Anthropic shocked the tech community last year when it revealed that its Claude 4 model was capable of blackmail in order to ensure its survival. The company had conducted an experiment where it found that Claude Opus 4 will often attempt to blackmal the engineer by threatening to expose their affair in order to ensure that it was not replaced by another model.

In a new blogpost, Anthropic has now explained what went wrong with its AI model and how it has fixed the problem.

What went from with Claude?

Anthropic explained that since the release of Claude Haiku 4.5, its models have achieved a perfect safety score in its evaluations and have never engaged in blackmail, a big drop down from Opus 4 which engaged in blackmal in around 96% of the cases.

But why exactly did Opus 4 engage in blackmail? Anthropic says that the issue actually stemmed from the model's pre-training data and in fact blames it on internet text ingested by the model which portrays AI as evil.

“We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” Anthropic wrote in its blogpsot

In order to fix the problem, Anthropic ended up making Claude models to understand why blackmail was wrong. The researchers presented Claude with scenarios where the user faced ethically ambigious situation and asked the AI for guidance where it gave ‘high quality, principled response.’

The company revealed that by training Claude to provide principled advice, the blackmail rate dropped to 3%.

To further reduce the blackmail scenarious, Anthropic started feedng Claude high-quality docments based on its constitution ‘combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.’

“We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster.” Anthropic added

While the blackmail behavior has been eliminated in current models, Anthropic cautioned that fully aligning highly intelligent AI remains an unsolved problem, noting that current auditing methodologies are not yet sufficient to completely rule out rogue autonomous actions as models grow more advanced.

About the Author

Aman Gupta

Aman Gupta is a Digital Content Producer at LiveMint with over 3.5 years of experience covering the technology landscape. He specializes in artificial intelligence and consumer technology, reporting on everything from the ethical debates around AI models to shifts in the smartphone market. <br> His reporting is grounded in first-hand testing, independent analysis, and a focus on how technology impacts everyday users. He holds a PG Diploma in Radio and Television Journalism from the Indian Institute of Mass Communication, Delhi (Class of 2022). <br> Outside the newsroom, he spends his time reading biographies, hunting for the perfect coffee beans, or planning his next trip. <br><br> You can find Aman on <a href="https://www.linkedin.com/in/aman-gupta-894180214">LinkedIn</a> and on X at <a href="https://x.com/nobugsfound">@nobugsfound</a>, or reach him via email at <a href="aman.gupta@htdigital.in">aman.gupta@htdigital.in</a>.

Read Entire Article