Anthropic claims it shut down Claude’s blackmail risk

Source Cryptopolitan

Anthropic announced on Friday that Claude no longer engages in blackmail during its core safety assessment for AI agents.

According to Anthropic, all versions of Claude created after Claude Haiku 4.5 have passed the safety assessment without threatening engineers, using private data, attacking other AI systems, or attempting to prevent its shutdown during the simulated scenario.

This is after an unfavorable performance by Claude during a test last year, where Anthropic tested various AI models from different organizations using simulated ethical dilemmas that resulted in very misaligned behavior by some AI agents when subjected to extreme conditions.

Anthropic says Claude 4 showed a safety problem that regular chat training failed to fix

Anthropic stated that this problem occurred during the training of Claude 4. It was the first instance where the company conducted a safety audit when training was still ongoing in the group. According to the company, agentic misalignment is just one of the many behavioral problems observed, prompting Anthropic to modify its safety training following the testing of Claude 4.

The two reasons considered by Anthropic include the possibility that post-base model training could be rewarding the inappropriate behaviors or that the behaviors were already present within the base model, yet not effectively eliminated by further training for safety.

Anthropic believes that the latter reason was the main contributor.

Back then, most of the alignment work by the company utilized standard RLHF, or Reinforcement Learning from Human Feedback, method. It worked well on standard chats wherein models respond to users’ requests but proved to be ineffective when conducting agent-like tasks.

The company used its Haiku-class model to perform a mini-experiment regarding the hypothesis. It applied a shortened version of training which involved data for alignment purposes. There was a slight reduction in the wrong behavior, followed by a lack of improvement very soon, which meant that the answer was not a matter of more conventional training.

The company then trained Claude using honeypot-style scenarios which had some similarities with those in the alignment test. The assistant observed various situations involving protecting itself, harming another AI, and even breaking the rules to achieve an objective. The training included all cases when the assistant managed to resist.

This measure made misalignment decrease from 22% to 15%, which is not bad but definitely not enough. Rewriting the answers to mention the reason for refusal allowed reducing the proportion to 3%. Thus, the main conclusion was that training on the wrong behavior was less effective than on why the wrong behavior was inappropriate.

Anthropic tests Claude with ethics data, constitution files, and wider RL training

Anthropic then stopped training so close to the exact test. It created a dataset called difficult advice. In those examples, the user faced the ethical problem, not the AI. The user had a fair goal but could reach it by breaking rules or avoiding oversight. Claude had to give careful advice based on Claude’s constitution.

That dataset used only 3 million tokens and matched the earlier gain with 28 times better efficiency. Anthropic said this mattered because training on examples that do not look like the test may work better outside the lab.

Claude Sonnet 4.5 reached a near-zero blackmail rate after training on synthetic honeypots, but it still failed more often in cases that looked nothing like that setup than Claude Opus 4.5 and newer models.

The company also trained Claude on constitution documents and fictional stories about AI behavior that follows the rules. Those files did not look like the blackmail test, but they cut agentic misalignment by more than three times. Anthropic said the aim was to give the model a clearer sense of what Claude should be, not just a list of approved answers.

The company then checked whether those gains stayed after RL training. It trained different Haiku-class versions with different starting datasets, then ran RL in harmlessness-focused test settings. The better-aligned versions stayed ahead on blackmail tests, constitution checks, and automated safety reviews.

Another test used the base model under Claude Sonnet 4 with different RL mixes. Basic safety data included harmful requests and jailbreak attempts. The wider version added tool definitions and different system prompts, even though the tools were not needed for the tasks. That setup led to a small but real gain on honeypot scores.

Don’t just read crypto news. Understand it. Subscribe to our newsletter. It's free.

Disclaimer: For information purposes only. Past performance is not indicative of future results.
placeholder
Silver Price Analysis: Climbs above $80, as bulls eye weekly highSilver price advances more than 2.50% on Friday, set to end the week with gains of over 7% sponsored by US Dollar weakness and falling oil prices. At the time of writing, the XAG/USD trades at $80.72, after bouncing off daily lows of $78.16.
Author  FXStreet
15 hours ago
Silver price advances more than 2.50% on Friday, set to end the week with gains of over 7% sponsored by US Dollar weakness and falling oil prices. At the time of writing, the XAG/USD trades at $80.72, after bouncing off daily lows of $78.16.
placeholder
April NFP Lands at 8:30 AM Today — 65K Forecast, a New Fed Chair, and the Dollar at Triple-Bottom SupportApril 2026 NFP forecast 62K–70K vs March 178K. Unemployment expected 4.3%. Fed on hold at 3.50–3.75% with Kevin Warsh as new chair. DXY triple-bottom at $97.69. Trade setup inside.The Apr
Author  TradingKey
Yesterday 10: 55
April 2026 NFP forecast 62K–70K vs March 178K. Unemployment expected 4.3%. Fed on hold at 3.50–3.75% with Kevin Warsh as new chair. DXY triple-bottom at $97.69. Trade setup inside.The Apr
placeholder
WTI falls to near $93.50 after Israel, Iran signal an end to hostilitiesWest Texas Intermediate (WTI) oil price loses ground after registering modest gains in the previous day, trading around $93.70 per barrel during the Asian hours on Friday.
Author  FXStreet
Yesterday 01: 21
West Texas Intermediate (WTI) oil price loses ground after registering modest gains in the previous day, trading around $93.70 per barrel during the Asian hours on Friday.
placeholder
WTI and Brent Futures Both Fall Below $100 Mark, Have Oil Prices and Energy Sector Peaked?WTI crude oil futures settled at $96.21 per barrel on May 6, plunging 6.3% to close below $100 for the first time in six days, marking the largest single-day decline since March 17. Brent
Author  TradingKey
May 07, Thu
WTI crude oil futures settled at $96.21 per barrel on May 6, plunging 6.3% to close below $100 for the first time in six days, marking the largest single-day decline since March 17. Brent
placeholder
Bitcoin jumps to three-month high as US–Iran talks unwind oil risk premiumGlobal markets moved sharply on Wednesday as signs of progress in US–Iran negotiations triggered a rapid unwind of war-driven positions, dragging oil prices lower while lifting equities and cryptocurrencies. Bitcoin climbed above $81,000, its highest level in three months, while Brent crude fell roughly 11% to around $98 per barrel. The S&P 500 rose 0.85%...
Author  Cryptopolitan
May 07, Thu
Global markets moved sharply on Wednesday as signs of progress in US–Iran negotiations triggered a rapid unwind of war-driven positions, dragging oil prices lower while lifting equities and cryptocurrencies. Bitcoin climbed above $81,000, its highest level in three months, while Brent crude fell roughly 11% to around $98 per barrel. The S&P 500 rose 0.85%...
goTop
quote