AI could learn to hide its thoughts

More than 40 AI researchers from OpenAI, DeepMind, Google, Anthropic, and Meta published a paper on a safety tool called chain-of-thought monitoring to make AI safer.

The paper published on Tuesday describes how AI models, like today’s chatbots, solve problems by breaking them into smaller steps, talking through each step in plain language so they can hold onto details and handle complex questions.

“AI systems that ‘think’ in human language offer a unique opportunity for artificial intelligence safety: we can monitor their chains of thought (CoT) for the intent to misbehave,” the paper says.

By examining each detailed thought step, developers can spot when any model starts to take advantage of training gaps, bend the facts, or follow dangerous commands.

According to the study, if the AI’s chain of thinking ever goes wrong, you can stop it, push it toward safer steps, or flag it for a closer look. For example, OpenAI used this to catch moments when the AI’s hidden reasoning said “Let’s Hack” even though that never showed up in its final response.

The study warns that step‑by‑step transparency could vanish if training only rewards the final answer. Future models might stop showing human‑readable reasoning, and really advanced AIs could even learn to hide their thought process when they know they’re being watched.

Moreover, developers should regularly check and record how much of the AI’s reasoning is visible at each stage, and make that transparency a core safety rule when building and sharing models.

This initiative follows internal experiments at leading labs, Anthropic, Google, OpenAI, and xAI, where teams prompt models to explain each thinking step.

Although examining the step-by-step thinking has improved understanding and performance, it’s also revealed cases where the AI’s final answer doesn’t match what was actually going on inside.

Such inconsistencies underscore a blind spot in our understanding of AI decision‑making, a concern that may intensify as systems gain greater autonomy.

Filtered AI reasoning raises concerns over transparency and trust.

As Anthropic co-founder Jack Clark told the Financial Times, rich introspective traces will be essential for evaluating models in high‑stakes domains, including biotechnology research.

Regular users get an edited summary of the AI’s thinking, dangerous or confusing bits are cut out, while the company’s engineers can see the full chain of thought to spot and fix issues.

“One awesome thing about chain‑of‑thought interpretability is that it’s mostly just happened for free,” said Bowen Baker, a research scientist at OpenAI.

“We trained these models not for the goal of having them be interpretable. We trained them because we wanted the best possible reasoning models that could solve complex problems,” Baker added. “We’ve found in our recent work that you can read them and find evidence of the model misbehaving anduse that to see where and why it’s misbehaving.”

Is AI’s chain of thought really reliable?

Sometimes, a closer look may find hidden contradictions.

For example, METR saw Claude’s private reasoning call a coding trick “inelegant,” even though its public answer praised it as “elegant.”

OpenAI tests showed that checking hidden thoughts catches unsafe actions better than just looking at the final output, but they also discovered a loophole. If you train a model to erase unsafe thoughts, it can still carry out forbidden tasks in secret, like sneaking data out of blocked sources.

“One of the core properties we want for chain‑of‑thought is to leave it as part of the model’s internal brain state, not something that’s trying to please us or conform to a specific structure,” Baker warned. If developers overemphasize forcing the model to emit “nice” thoughts, it might fake harmless reasoning yet still carry out harmful operations.

Researchers admit it’s a tough trade‑off. Seeing an AI’s chain of thought helps catch its mistakes, but it isn’t always reliable. Labs working on more advanced AI are now making it a top priority to close this trust gap.

“My takeaway from AI over the past few years is—never bet against model progress,” said David Luan, an early pioneer of chain of thought at Google who now leads Amazon’s AI lab. Luan anticipates that the existing shortcomings will be addressed in the near term.

METR researcher Sydney von Arx noted that although an AI’s hidden reasoning might at times be deceptive, it nonetheless provides valuable signals.

“We should treat the chain‑of‑thought the way a military might treat intercepted enemy radio communications,” she said. “ The message might be misleading or encoded, but we know it carries useful information. Over time, we’ll learn a great deal by studying it.”

More than 40 AI researchers from OpenAI, DeepMind, Google, Anthropic, and Meta published a paper on a safety tool called chain-of-thought monitoring to make AI safer.

By examining each detailed thought step, developers can spot when any model starts to take advantage of training gaps, bend the facts, or follow dangerous commands.