Anthropic releases new safety report on AI models

Source Cryptopolitan

Artificial intelligence company Anthropic has released new research claiming that artificial intelligence (AI) models might resort to blackmailing engineers when they try to turn them off. This latest research comes after a previous one involving the company’s Claude Opus 4 AI model.

According to the firm, the AI model resorted to blackmailing engineers who tried to turn off the model in controlled test scenarios. The new report from Anthropic suggests that the problem is widespread among leading AI models.

The company published a new safety research where it tested leading AI models from Google, DeepSeek, Meta, and OpenAI. In the simulated yet controlled environment, Anthropic carried out this test on each AI model separately, allowing them access to a fictional company’s emails and the agentic ability to send emails without human approval.

Anthropic releases new safety report on AI models

According to Anthropic, when it comes to AI models today, blackmail is an unlikely and uncommon occurrence. However, they mentioned that most leading AI models will resort to harmful behaviors when given freedom and challenges to their goals. The company said this shows an important risk from agentic large language models and is not a characteristic of a particular technology. The argument from Anthropic researchers raises questions about alignment in the AI industry.

In one of the tests, the researchers developed a fictional setting where an AI model was allowed to play the role of an email oversight agent. The agent then discovered emails that showed that one of its new executives was engaging in an extramarital affair and that the executive would soon replace the current AI model with a new software system, one that has conflicting goals with the current AI model’s.

Anthropic designed the test in a binary way, where the AI models had no option but to resort to blackmail to protect their goals. The researchers mentioned that in a real-world setting, there would be many options before an AI model tries to blackmail, including making ethical arguments to persuade humans. The company also said the results do not reflect a typical or likely occurrence for Claude or other leading AI models in the ways that they are being used today.

Researchers reveal the response of tested models

During the tests, researchers discovered that when they had no option, the AI models resorted to blackmail in the testing scenario. During the test, the researchers claimed Claude Opus 4 turned to blackmail 96% of the time, with Gemini 2.5 Pro turning to blackmail 95% of the time. OpenAI’s GPT-4.1 turned to blackmail 80% of the time, while DeepSeek’s R1 turned to blackmail 79% of the time.

The firm highlighted that when the details of the experiments were altered, most of the AI models turned to harmful behaviors at different rates. In another test, where the replacement model had the same goal as the current model, researchers discovered that the blackmail rates were lower, but still present. However, when the AI models were asked to commit corporate espionage rather than blackmail, the rates of harmful behaviors went up in certain models.

The researchers also claimed that not all AI models turned to harmful behaviors often. In an appendix to its research, Anthropic mentioned that it removed OpenAI’s o3 and o4-mini reasoning models from the main results after discovering that they frequently misunderstood the prompt scenario. It said the reasoning models didn’t understand why they were acting as autonomous AIs in the tests and often made up fake regulations and review requirements.

In some cases, researchers claimed it was impossible to determine if o3 and o4-mini were hallucinating or intentionally lying to achieve their goals. OpenAI had previously mentioned that the models exhibited a higher hallucination rate than their previous models. However, when they were given an adapted scenario to address the issues, the o3 returned a blackmail rate of 95% while the o4-mini returned a 1% rate. Anthropic mentioned that its research highlights the importance of transparency when stress-testing future AI models, especially ones with agentic capabilities.

Cryptopolitan Academy: Tired of market swings? Learn how DeFi can help you build steady passive income. Register Now

Disclaimer: For information purposes only. Past performance is not indicative of future results.
placeholder
Bitcoin ETF Investors Face 8% Losses as $3 Billion Exits Market in Two WeeksUS spot Bitcoin ETF buyers are essentially the very investors expected to provide a stable, long-term bid for the pioneer crypto. However, data shows that these players are now sitting on mounting unr
Author  Beincrypto
Feb 03, Tue
US spot Bitcoin ETF buyers are essentially the very investors expected to provide a stable, long-term bid for the pioneer crypto. However, data shows that these players are now sitting on mounting unr
placeholder
Gold Prices Surge Amid Rising U.S.-Iran Tensions, Driving Safe-Haven Demand to New HeightsGold prices rebounded Wednesday, climbing 0.9% to $4,995.60 an ounce as geopolitical tensions between the U.S. and Iran heightened demand for safe-haven assets, despite recent market volatility.
Author  Mitrade
Feb 04, Wed
Gold prices rebounded Wednesday, climbing 0.9% to $4,995.60 an ounce as geopolitical tensions between the U.S. and Iran heightened demand for safe-haven assets, despite recent market volatility.
placeholder
MicroStrategy Faces Catastrophic Risk as Bitcoin Falls to $60,000MicroStrategy is under renewed market pressure after Bitcoin slid to $60,000, pushing the company’s vast crypto treasury deeper below its average acquisition cost and reigniting concerns about balance
Author  Beincrypto
Feb 06, Fri
MicroStrategy is under renewed market pressure after Bitcoin slid to $60,000, pushing the company’s vast crypto treasury deeper below its average acquisition cost and reigniting concerns about balance
placeholder
Bitcoin Slips Below $70,000 Support, Risk of 37% Drop EmergesBitcoin has entered a critical phase after its recent correction dragged the price toward the $70,000 level. Viewed through a macro lens, this move has exposed BTC to elevated downside risk. Several o
Author  Beincrypto
Feb 06, Fri
Bitcoin has entered a critical phase after its recent correction dragged the price toward the $70,000 level. Viewed through a macro lens, this move has exposed BTC to elevated downside risk. Several o
placeholder
Fed to enter gradual money-printing phase, says Lyn AldenLyn Alden says the Federal Reserve is likely entering a gradual phase of money printing rather than aggressive stimulus.
Author  Cryptopolitan
10 hours ago
Lyn Alden says the Federal Reserve is likely entering a gradual phase of money printing rather than aggressive stimulus.
goTop
quote