Meta, Google, OpenAI researchers fear that AI could learn to hide its thoughts

Source Cryptopolitan

More than 40 AI researchers from OpenAI, DeepMind, Google, Anthropic, and Meta published a paper on a safety tool called chain-of-thought monitoring to make AI safer. 

The paper published on Tuesday describes how AI models, like today’s chatbots, solve problems by breaking them into smaller steps, talking through each step in plain language so they can hold onto details and handle complex questions.

“AI systems that ‘think’ in human language offer a unique opportunity for artificial intelligence safety: we can monitor their chains of thought (CoT) for the intent to misbehave,” the paper says.

By examining each detailed thought step, developers can spot when any model starts to take advantage of training gaps, bend the facts, or follow dangerous commands.

According to the study, if the AI’s chain of thinking ever goes wrong, you can stop it, push it toward safer steps, or flag it for a closer look. For example, OpenAI used this to catch moments when the AI’s hidden reasoning said “Let’s Hack” even though that never showed up in its final response.

AI could learn to hide its thoughts

The study warns that step‑by‑step transparency could vanish if training only rewards the final answer. Future models might stop showing human‑readable reasoning, and really advanced AIs could even learn to hide their thought process when they know they’re being watched.

Moreover, developers should regularly check and record how much of the AI’s reasoning is visible at each stage, and make that transparency a core safety rule when building and sharing models.

This initiative follows internal experiments at leading labs, Anthropic, Google, OpenAI, and xAI, where teams prompt models to explain each thinking step.

Although examining the step-by-step thinking has improved understanding and performance, it’s also revealed cases where the AI’s final answer doesn’t match what was actually going on inside.

Such inconsistencies underscore a blind spot in our understanding of AI decision‑making, a concern that may intensify as systems gain greater autonomy.

Filtered AI reasoning raises concerns over transparency and trust

As Anthropic co-founder Jack Clark told the Financial Times, rich introspective traces will be essential for evaluating models in high‑stakes domains, including biotechnology research.

Regular users get an edited summary of the AI’s thinking, dangerous or confusing bits are cut out, while the company’s engineers can see the full chain of thought to spot and fix issues.

“One awesome thing about chain‑of‑thought interpretability is that it’s mostly just happened for free,” said Bowen Baker, a research scientist at OpenAI.

“We trained these models not for the goal of having them be interpretable. We trained them because we wanted the best possible reasoning models that could solve complex problems,” Baker added. “We’ve found in our recent work that you can read them and find evidence of the model misbehaving anduse that to see where and why it’s misbehaving.”

Is AI’s chain of thought really reliable?

Sometimes, a closer look may find hidden contradictions.

For example, METR saw Claude’s private reasoning call a coding trick “inelegant,” even though its public answer praised it as “elegant.”

OpenAI tests showed that checking hidden thoughts catches unsafe actions better than just looking at the final output, but they also discovered a loophole. If you train a model to erase unsafe thoughts, it can still carry out forbidden tasks in secret, like sneaking data out of blocked sources.

“One of the core properties we want for chain‑of‑thought is to leave it as part of the model’s internal brain state, not something that’s trying to please us or conform to a specific structure,” Baker warned. If developers overemphasize forcing the model to emit “nice” thoughts, it might fake harmless reasoning yet still carry out harmful operations.

Researchers admit it’s a tough trade‑off. Seeing an AI’s chain of thought helps catch its mistakes, but it isn’t always reliable. Labs working on more advanced AI are now making it a top priority to close this trust gap.

“My takeaway from AI over the past few years is—never bet against model progress,” said David Luan, an early pioneer of chain of thought at Google who now leads Amazon’s AI lab. Luan anticipates that the existing shortcomings will be addressed in the near term.

METR researcher Sydney von Arx noted that although an AI’s hidden reasoning might at times be deceptive, it nonetheless provides valuable signals.

“We should treat the chain‑of‑thought the way a military might treat intercepted enemy radio communications,” she said. “ The message might be misleading or encoded, but we know it carries useful information. Over time, we’ll learn a great deal by studying it.”

Cryptopolitan Academy: Tired of market swings? Learn how DeFi can help you build steady passive income. Register Now

Disclaimer: For information purposes only. Past performance is not indicative of future results.
placeholder
Ripple’s $21 Trillion Dream: What Capturing 20% Of SWIFT Volume Means For XRPRipple Labs, a crypto payments company, continues to set its ambitions and those of XRP higher than ever as it edges closer to disrupting the global financial messaging giant SWIFT. After Ripple CEO
Author  NewsBTC
7 Month 14 Day Mon
Ripple Labs, a crypto payments company, continues to set its ambitions and those of XRP higher than ever as it edges closer to disrupting the global financial messaging giant SWIFT. After Ripple CEO
placeholder
BNB Price Stalls: Struggles to Resume Gains While Altcoins RallyBNB price is correcting gains from the $708 zone. The price is now facing hurdles near $692 and might dip again toward the $675 support. BNB price is attempting to recover from the $675 support zone.
Author  NewsBTC
20 hours ago
BNB price is correcting gains from the $708 zone. The price is now facing hurdles near $692 and might dip again toward the $675 support. BNB price is attempting to recover from the $675 support zone.
placeholder
AUD/JPY sticks to gains above 97.00, close to multi-month high set on TuesdayThe AUD/JPY cross attracts fresh buyers during the Asian session on Wednesday and steadily climbs back closer to its highest level since late January touched the previous day.
Author  FXStreet
20 hours ago
The AUD/JPY cross attracts fresh buyers during the Asian session on Wednesday and steadily climbs back closer to its highest level since late January touched the previous day.
placeholder
XRP Price Eyes Fresh Gains: Traders Bullish After Momentum SpikeXRP price started a fresh increase and traded above the $3.00 zone. The price is now correcting gains and might find bids near the $2.840 support zone. XRP price started a fresh increase above the
Author  NewsBTC
20 hours ago
XRP price started a fresh increase and traded above the $3.00 zone. The price is now correcting gains and might find bids near the $2.840 support zone. XRP price started a fresh increase above the
placeholder
Gold price advances to $3,335 area; lacks bullish conviction amid reduced Fed rate cut betsGold price (XAU/USD) edges higher during the Asian session on Wednesday and reverses a part of the overnight downfall to a multi-day low, though it lacks follow-through buying.
Author  FXStreet
20 hours ago
Gold price (XAU/USD) edges higher during the Asian session on Wednesday and reverses a part of the overnight downfall to a multi-day low, though it lacks follow-through buying.
goTop
quote