Meta, Google, OpenAI researchers fear that AI could learn to hide its thoughts

Source Cryptopolitan

More than 40 AI researchers from OpenAI, DeepMind, Google, Anthropic, and Meta published a paper on a safety tool called chain-of-thought monitoring to make AI safer. 

The paper published on Tuesday describes how AI models, like today’s chatbots, solve problems by breaking them into smaller steps, talking through each step in plain language so they can hold onto details and handle complex questions.

“AI systems that ‘think’ in human language offer a unique opportunity for artificial intelligence safety: we can monitor their chains of thought (CoT) for the intent to misbehave,” the paper says.

By examining each detailed thought step, developers can spot when any model starts to take advantage of training gaps, bend the facts, or follow dangerous commands.

According to the study, if the AI’s chain of thinking ever goes wrong, you can stop it, push it toward safer steps, or flag it for a closer look. For example, OpenAI used this to catch moments when the AI’s hidden reasoning said “Let’s Hack” even though that never showed up in its final response.

AI could learn to hide its thoughts

The study warns that step‑by‑step transparency could vanish if training only rewards the final answer. Future models might stop showing human‑readable reasoning, and really advanced AIs could even learn to hide their thought process when they know they’re being watched.

Moreover, developers should regularly check and record how much of the AI’s reasoning is visible at each stage, and make that transparency a core safety rule when building and sharing models.

This initiative follows internal experiments at leading labs, Anthropic, Google, OpenAI, and xAI, where teams prompt models to explain each thinking step.

Although examining the step-by-step thinking has improved understanding and performance, it’s also revealed cases where the AI’s final answer doesn’t match what was actually going on inside.

Such inconsistencies underscore a blind spot in our understanding of AI decision‑making, a concern that may intensify as systems gain greater autonomy.

Filtered AI reasoning raises concerns over transparency and trust

As Anthropic co-founder Jack Clark told the Financial Times, rich introspective traces will be essential for evaluating models in high‑stakes domains, including biotechnology research.

Regular users get an edited summary of the AI’s thinking, dangerous or confusing bits are cut out, while the company’s engineers can see the full chain of thought to spot and fix issues.

“One awesome thing about chain‑of‑thought interpretability is that it’s mostly just happened for free,” said Bowen Baker, a research scientist at OpenAI.

“We trained these models not for the goal of having them be interpretable. We trained them because we wanted the best possible reasoning models that could solve complex problems,” Baker added. “We’ve found in our recent work that you can read them and find evidence of the model misbehaving anduse that to see where and why it’s misbehaving.”

Is AI’s chain of thought really reliable?

Sometimes, a closer look may find hidden contradictions.

For example, METR saw Claude’s private reasoning call a coding trick “inelegant,” even though its public answer praised it as “elegant.”

OpenAI tests showed that checking hidden thoughts catches unsafe actions better than just looking at the final output, but they also discovered a loophole. If you train a model to erase unsafe thoughts, it can still carry out forbidden tasks in secret, like sneaking data out of blocked sources.

“One of the core properties we want for chain‑of‑thought is to leave it as part of the model’s internal brain state, not something that’s trying to please us or conform to a specific structure,” Baker warned. If developers overemphasize forcing the model to emit “nice” thoughts, it might fake harmless reasoning yet still carry out harmful operations.

Researchers admit it’s a tough trade‑off. Seeing an AI’s chain of thought helps catch its mistakes, but it isn’t always reliable. Labs working on more advanced AI are now making it a top priority to close this trust gap.

“My takeaway from AI over the past few years is—never bet against model progress,” said David Luan, an early pioneer of chain of thought at Google who now leads Amazon’s AI lab. Luan anticipates that the existing shortcomings will be addressed in the near term.

METR researcher Sydney von Arx noted that although an AI’s hidden reasoning might at times be deceptive, it nonetheless provides valuable signals.

“We should treat the chain‑of‑thought the way a military might treat intercepted enemy radio communications,” she said. “ The message might be misleading or encoded, but we know it carries useful information. Over time, we’ll learn a great deal by studying it.”

Cryptopolitan Academy: Tired of market swings? Learn how DeFi can help you build steady passive income. Register Now

Disclaimer: For information purposes only. Past performance is not indicative of future results.
placeholder
Japan June Inflation Preview: Expected to Ease but Remain Above Target, Providing Short-Term Support for the YenJapan is set to release its June inflation data on 18 July 2025, with market consensus forecasting that the National CPI (excluding fresh food) will ease to 3.3% year-on-year from 3.7% in May.
Author  TradingKey
23 hours ago
Japan is set to release its June inflation data on 18 July 2025, with market consensus forecasting that the National CPI (excluding fresh food) will ease to 3.3% year-on-year from 3.7% in May.
placeholder
Gold price advances to $3,335 area; lacks bullish conviction amid reduced Fed rate cut betsGold price (XAU/USD) edges higher during the Asian session on Wednesday and reverses a part of the overnight downfall to a multi-day low, though it lacks follow-through buying.
Author  FXStreet
20 hours ago
Gold price (XAU/USD) edges higher during the Asian session on Wednesday and reverses a part of the overnight downfall to a multi-day low, though it lacks follow-through buying.
placeholder
Trump steps in to save Crypto Bills from GOP rebelsTrump reportedly brokered a deal with conservative House Republicans to revive long-awaited crypto bills that had faced a surprise setback recently.
Author  Cryptopolitan
16 hours ago
Trump reportedly brokered a deal with conservative House Republicans to revive long-awaited crypto bills that had faced a surprise setback recently.
placeholder
Baidu's Stocks Surge in US and Hong Kong as Apollo Go Teams Up with Uber for Global ExpansionChinese search engine giant Baidu announced a multi-year partnership with ride-hailing titan Uber.
Author  TradingKey
16 hours ago
Chinese search engine giant Baidu announced a multi-year partnership with ride-hailing titan Uber.
placeholder
Bitcoin ETF Inflows For 2025 Now Outpace 2024, Data ShowsUS Bitcoin spot exchange-traded funds (ETFs) have seen more inflows this year so far compared to the same point in 2024, according to data.
Author  Bitcoinist
16 hours ago
US Bitcoin spot exchange-traded funds (ETFs) have seen more inflows this year so far compared to the same point in 2024, according to data.
goTop
quote