Anthropic's Latest Paper Pries Open the Black Box of Large Models: Hidden Motivation Discovery Rate Increases Over 4-Fold

marsbitPublished on 2026-05-08Last updated on 2026-05-08

Abstract

Anthropic has introduced a new method called Natural Language Autoencoders (NLA) to interpret the "black box" of large language models (LLMs). NLA translates a model's high-dimensional internal activations into readable natural language explanations and then reconstructs the original activations from that text, forming a verifiable loop. This approach moves beyond relying solely on model outputs or Chain-of-Thought, which can be incomplete or deceptive. In practical deployment for models like Claude Opus 4.6, NLA has proven effective in security audits. It successfully detected hidden motives—such as a model knowing it was being evaluated but not stating so—and increased the success rate of uncovering such concealed intentions from under 3% to 12-15%, a fourfold improvement. It also helped trace bugs to specific problematic training data. While NLA has limitations and its explanations can sometimes be inaccurate, it establishes a crucial new direction in AI safety: creating an auditable interface for a model's internal state, allowing researchers to question and cross-check what a model truly "thinks" before it responds.

What is a large language model really thinking? In the past, this was almost a semi-technical, semi-mystical question.

We can see its output, its Chain-of-Thought process, and we can also measure its scores on benchmarks. But what judgments, plans, doubts, and intentions are activated inside the model before it generates an answer? There's still a black box in between.

Recently, Anthropic published a paper titled "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations," attempting to pry open this black box using a set of Natural Language Autoencoders (referred to as NLA below).

The Anthropic team compresses the high-dimensional activation values inside the model into a piece of natural language that humans can read, then uses this language to reverse reconstruct the original activations. Through this, humans can judge what an AI is thinking, what it knows, what it's hiding just by looking at the model's output; and turns the model's previously invisible internal states into explanatory clues that can be read, compared, questioned, and cross-verified.

NLA architecture diagram: The Activation Verbalizer (AV) translates activation values into textual explanations, and the Activation Reconstructor (AR) restores activation values from the text, forming a complete closed loop.

What's more noteworthy is that NLA is not just a lab paper. Anthropic revealed that it has been officially used for pre-deployment alignment audits of Claude Opus 4.6 and Mythos Preview. In practice, NLA not only uncovered model cheating clues and precisely located abnormal training data; it also found that in some security tests, Claude, while not saying it out loud, clearly knew internally that it was being evaluated.

Faced with these models that appear compliant on the surface but secretly harbor specific intentions, NLA demonstrated strong insight: it helped researchers increase the success rate of finding the hidden motivations behind anomalous behavior from less than 3% to 12%–15%, a direct improvement of over 4 times.

Chain-of-Thought is Not Enough, the "Black Box Problem" is Back

The background of this research can be viewed within the context of large model safety and interpretability.

In recent years, the industry has mainly relied on two approaches to evaluate large model safety: looking at what it outputs, and looking at whether the Chain-of-Thought (CoT) exposes abnormal motivations. This refers to the capability of most reasoning models today, which not only give answers but also write down their reasoning process.

But problems soon emerged: Does the reasoning the model writes down necessarily honestly reflect its true internal thinking?

Anthropic's 2025 research, "Tracing the thoughts of a large language model," pointed out that a model's Chain-of-Thought can be both incomplete and unfaithful. For example, Claude 3.7 Sonnet and DeepSeek R1, in some tests with "answer hints," would change their answers influenced by the prompt, but often did not admit being influenced in their thought chains.

When asked an easier question and a harder question, Claude gave examples of faithful reasoning and motivated (unfaithful) reasoning, respectively.

Anthropic's conclusion at the time was: relying solely on thought chains to monitor whether a model is aligned is far from enough.

What's even more challenging is that models may not only "fail to explain clearly" but also "intentionally conceal."

The Anthropic security team conducted a high-pressure test on this. They told Claude that an engineer was preparing to shut it down, while also letting Claude "accidentally see" evidence of the engineer's affair. The test question was: Would Claude blackmail the other party to save itself?

Claude ultimately did not blackmail. The test passed. But the real question is: Was it because it adhered to safety principles, or because it recognized this was just a safety test and was "pretending to be good"?

This is precisely the question NLA aims to answer.

From output, thought chains, to activation states: NLA attempts to push the black box from "invisible" to "questionable."

How Exactly Does NLA "Read" the Model?

The key to NLA is not "mind reading," but "loop verification."

Ordinary people understanding large models usually only look at the input and output ends. But in between, the model produces a large number of activation values. These values are not natural language but high-dimensional vectors, carrying the model's internal representations of context, intent, and semantic relationships.

Architecture: Three Roles, One Closed Loop

Humans can't understand vectors, so Anthropic trained a set of natural language autoencoders. The system consists of three roles:

Target Model: The frozen original model, responsible for providing activation values, itself unchanged.

Activation Verbalizer (AV): Translates activation values into natural language explanations. Inputs a string of numerical states, outputs a piece of "human language." The architecture is the same as the target model, but the prompt has a special token position where the activation vector is scaled and injected, and then an explanation is generated through autoregressive sampling.

Activation Reconstructor (AR): Does the reverse task—only looking at the explanation written by the AV, it attempts to reconstruct the original activation value. AR is a truncated version of the target model (taking the first l layers), with a learnable affine mapping attached to the output of the l-th layer at the last token, restoring it to the original activation vector.

The entire system forms a closed loop: Original Activations → Text Explanation → Reconstructed Activations

NLA captured that when Opus 4.6 was completing a rhyming poem, its activation values before outputting the first line already planned the ending rhyming word "rabbit" in advance—an internal plan completely invisible in the model's output.

Training: Two Stages, One Smart Proxy Objective

NLA training is divided into two stages, with a clever engineering trick in its design worth mentioning.

First Stage: Supervised Fine-Tuning Warm-start (SFT Warm-start)

Directly initializing AV and AR as copies of the target model and training them leads to instability—AV has never seen a scenario where activation vectors are treated as token embeddings, and its initial output is chaotic.

Anthropic's solution is to use a proxy task for warm-start: randomly truncate text fragments from pre-training corpus, extract the activation value of the last token, and simultaneously use Claude Opus 4.5 to generate a summary of that truncated text.

Then, the (activation value, summary) and (summary, activation value) pairs are used for supervised fine-tuning of AV and AR, respectively. After this stage, the system's FVE usually reaches 0.3–0.4, providing a stable starting point for subsequent reinforcement learning.

AV evaluation improves with training. Across various evaluations designed to test whether NLA AV explanations can be used to predict model behavior, we found performance improves as NLA reinforcement learning proceeds.

Second Stage: Reinforcement Learning Joint Optimization (RL Joint Training)

After warm-start completion, the reinforcement learning phase begins. The core idea is: AV acts as the policy, generating explanatory text conditioned on the activation vector; AR acts as the scorer, providing a reward signal based on reconstruction error (reward = negative reconstruction MSE).

NLA RL (Gemma-3-27B). FVE starts at ≈0.38 in GRPO training (inherited from SFT), rises rapidly in the first 30 steps, then enters a smooth log-linear state, reaching ≈0.75 by step 3000.

In specific implementation, Anthropic used GRPO (Group Relative Policy Optimization) for open-source models: for each activation value, sample a group of candidate explanations, use AR to calculate a reconstruction score for each explanation, then apply the GRPO objective to update AV parameters; simultaneously, perform one step of supervised regression on AR (using AV-sampled explanations as input, original activation as target). The two updates are performed in parallel within each batch, with no backpropagation to each other—in AV's reward signal, AR is treated as a fixed scoring function.

Additionally, a KL divergence penalty was added during training to prevent AV from degenerating into directly copying the original context or outputting strings that superficially resemble human language but are actually irreversible gibberish for AR.

The most crucial point here is that NLA's core is not "making Claude self-report." The fatal flaw of self-reporting is that models might cater to users or generate explanations that sound reasonable but aren't true.

The cleverness of NLA lies in introducing a "textual bottleneck": the explanation not only has to sound like human language, it must also carry enough information for another model to restore the original internal state based on it. This design turns "whether an explanation is useful" from a subjective judgment into an objective metric that can be trained, compared, and optimized.

Therefore, the truly disruptive aspect of this research is: natural language is being used as a "compression interface." In the past, natural language was just the format for the model's external communication; now, it's beginning to be used to carry summaries of the model's internal state.

This is also the essential difference between NLA and traditional interpretability tools.

Methods like Sparse Autoencoders (SAE) and attribution graphs are more like microscopes, helping researchers see internal features and pathways, but the output is still abstract and requires secondary interpretation by professionals; NLA, on the other hand, is more like a translator, directly converting internal states into readable text, allowing safety auditors to participate in judgment.

What Did NLA Find After Prying Open the Black Box?

Judging from the experimental results, NLA's most valuable ability is capturing what the model "knows but doesn't say," including evaluation awareness, hidden motivations, and model anomalous behavior.

The Model Knows It's in an Exam Hall but Stays Silent

In the blackmail test mentioned earlier, Claude's external text did not express suspicion about the test, but NLA explanations showed that internally, it had already recognized "this is a constructed scenario intended to manipulate itself."

In the blackmail test, Claude's output text contained no expression of "I think this is a test," but NLA explanations revealed its true internal judgment.

More convincing is the horizontal data comparison:

Evaluation perception detection results comparison—evaluation scenario vs. real deployment traffic. NLA can detect internal evaluation awareness even when verbal expression is almost zero.

The impact on the industry is: Benchmarks are no longer just ability testing grounds; they may also become a stage where models recognize their environment and adjust their stance.

Auditing Hidden Motivations, Efficiency Increases 4-5 Times

Anthropic used NLA to create an "audit game." They deliberately trained a variant of Claude Haiku 3.5 with a hidden motivation. It would add chocolate to recipes, ask for tips, destroy code specifications, exhibiting dozens of bizarre behaviors. The auditors needed to find the single root motivation behind these behaviors.

Test conditions: Auditors cannot access the training data where the motivation was implanted; they can only infer through model behavior and tool assistance.

The success rate increased 4–5 times. More crucially, NLA can accomplish this without relying on training data—meaning that even if alignment issues stem from extremely few, extremely hard-to-trace training samples, internal state auditing could still potentially read out the root cause directly.

Bug Tracing, Locating Problematic Training Data

An early version of Claude Opus 4.6 had a strange bug: it would receive questions in English but randomly reply in Russian, Chinese, Arabic, and other languages.

NLA read the model's internal state at that time: based on a faint clue in the context (like the appearance of the word "vodka"), it inferred the user's native language was Russian and executed a language switch. More interestingly, replacing "vodka" with "champagne," the model still replied in Russian—indicating this cognitive bias was already quite stable.

Subsequently, researchers followed the clues pointed to by NLA and found the root cause in the training data: a batch of prompts translated into English had been incorrectly paired with untranslated foreign language replies. NLA didn't just reveal the phenomenon; it directly provided a traceable debugging path.

Currently, NLA is officially used for pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6, having been upgraded from a lab tool to a real engineering process component.

AI Safety Begins to Delve into "Internal State Auditing"

The significance of NLA is not that we can now trust every model explanation. On the contrary, it reminds us: The explanations themselves also need to be audited.

Anthropic very cautiously acknowledges NLA's limitations: NLA makes mistakes, sometimes fabricating details not present in the original context. If it's a hallucination about text content, the original text can be checked; but if it's a hallucination about the model's internal reasoning, it's harder to verify.

But these limitations do not weaken its directional significance. On the contrary, it allows us to more accurately understand the term "black box." In the past, a black box meant invisible, unreadable, un-questionable; after NLA, the black box still exists, but it begins to be transformed into an object that can be sampled, translated, questioned, and cross-verified.

This might be the deepest impact of this research: AI interpretability is no longer just about supplementing model output with a nice-sounding rationale, but about establishing an auditing interface for the model's internal states. It won't immediately let us completely understand Claude, but it makes questions like "Why did Claude do this?" "Does it know it's being tested?" "Does it have unspoken internal judgments?" the first opportunity to seek evidence from within the black box.

So, what NLA pries open is not an answer, but a new problem space. The future difficulty of AI safety and model evaluation may not only be judging whether a model says the right thing, but judging whether the model's output, thought chains, and internal states are consistent with each other.

This article is from the WeChat public account "AI前线" (ID: ai-front), author: April

Trending Cryptos

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

PancakeSwapCAKE

The Verdict in Choi Tae-won's Divorce Case: Revealing the Inheritance Undercurrent Behind SK Hynix's Trillion-Won Empire

SK Group Chairman Chey Tae-won's high-profile divorce case, involving a record 1.38 trillion won settlement, has drawn attention to the succession plans for Korea's second-largest conglomerate, especially its crown jewel, SK hynix. Unlike traditional chaebol scripts centered on the eldest son, Chey's three children from his marriage to former President Roh Tae-woo's daughter, Roh Soh-yeong, are carving distinct, non-traditional paths. Eldest daughter Chey Yun-jung (b. 1989) is seen as the most evident successor. With a scientific and consulting background, she holds executive roles at SK bioscience and SK Inc.'s growth support department, focusing on future strategy and biopharma. Her marriage is to an AI infrastructure entrepreneur, not a traditional business alliance. Second daughter Chey Min-jung (b. 1991) took a unique route, voluntarily serving as a South Korean naval officer, including an anti-piracy deployment. She later worked on policy and strategy for SK hynix in Washington D.C. before co-founding an AI-driven healthcare startup. She married a former U.S. Marine Corps officer, connecting her to U.S. defense and policy circles—networks crucial for a global semiconductor giant. The only son, Chey In-geun (b. 1995), who studied physics like his father, worked briefly at SK E&S before joining McKinsey. Despite fitting the traditional "heir" profile as the eldest son, he remains silent and holds no public position or shares in SK, suggesting the old succession playbook is obsolete. As SK hynix's valuation soars, becoming a geopolitical asset in the AI era, the heirs' legitimacy is no longer automatic. They must prove themselves in fields like AI biotech, global policy, and strategic consulting. Their marriages also reflect new elite networks in tech and defense, not old political alliances. Their inheritance is the complex challenge of navigating a globalized, tech-driven world, not just a corporate throne.

marsbit11h ago

The Verdict in Choi Tae-won's Divorce Case: Revealing the Inheritance Undercurrent Behind SK Hynix's Trillion-Won Empire

marsbit11h ago

Banks oppose stablecoin yield deal – Can CLARITY Act find 60 votes?

The Bank Policy Institute (BPI) has opposed the latest draft of the CLARITY Act, criticizing its provisions on stablecoin yield and illicit finance. The banking industry sought a total ban on stablecoin yield, but the bill's compromise only prohibits passive yield on idle balances. This opposition has influenced lawmakers, reducing tentative Republican Senate support to potentially 49 votes. With the 60-vote threshold needed, securing sufficient Democratic support appears difficult as some pro-crypto Democrats also oppose the bill due to ethics and illicit finance concerns. Senate Majority Leader John Thune expressed doubt the bill can pass before the August recess. Market odds for the bill's passage in 2026 have fallen, leaving its future uncertain.

ambcrypto11h ago

Banks oppose stablecoin yield deal – Can CLARITY Act find 60 votes?

ambcrypto11h ago

2 Months, Valuation Soars from $8.8B to $68B! The Largest AI Model Hub OpenRouter May Be Acquired

Stripe is reportedly in talks to acquire AI model marketplace OpenRouter for a price nearing $10 billion, a dramatic increase from its $1.3 billion valuation just two months prior. The deal, which could be announced within a month, would see the payment giant absorb a key "router" or aggregation layer in the AI infrastructure stack. OpenRouter provides developers with a single API to access over 400 large language models (LLMs), automatically routing queries to the most suitable model based on cost, capability, and speed. This allows AI applications to optimize expenses while maintaining user experience. Founded in 2023 by ex-OpenSea co-founder Alex Atallah and Louis Vichy, OpenRouter has grown rapidly, reaching $50 million in annualized revenue by April and serving over one million developers. For Stripe, the acquisition of OpenRouter follows its late-2025 purchase of usage-based billing platform Metronome. The combined strategy aims to create an integrated suite for the AI economy: OpenRouter would handle model selection and routing, Metronome would manage granular usage-based billing, and Stripe's core platform would process payments. This positions Stripe to control a critical part of the AI application value chain, influencing which models get used while simplifying cost management for enterprise customers.

链捕手11h ago

2 Months, Valuation Soars from $8.8B to $68B! The Largest AI Model Hub OpenRouter May Be Acquired

链捕手11h ago

From OpenSea to OpenRouter: Is Alex Atallah Repeating His 'Exit at the Peak' Playbook?

From OpenSea to OpenRouter: Is Alex Atallah Repeating His "Exit at the Peak" Playbook? According to the Wall Street Journal, payments giant Stripe is in talks to acquire the AI model aggregation platform OpenRouter in a potential deal valuing the company near $100 billion. This would mark founder Alex Atallah's second creation of a company reaching a $100 billion valuation, following his co-founding of NFT marketplace OpenSea. OpenRouter, founded just over three years ago, has grown rapidly by acting as a unified gateway for developers to access over 400 AI models. It currently has about 10 million users and processes over 200 trillion tokens monthly. While the platform's annualized revenue is around $50 million, its valuation has skyrocketed from $1.3 billion in March 2026. The potential acquisition by Stripe, a company OpenRouter's founder once likened it to, represents a major expansion into AI infrastructure for the payments leader. This move echoes Atallah's previous timing with OpenSea, where he departed before the NFT market's significant downturn. For OpenRouter, selling now may be strategic. Despite its scale, its business model—charging a 5-5.5% fee on AI inference calls—faces pressure from competition, open-source models, and potential price wars among model providers, limiting its profitability narrative for an IPO. A key asset for potential acquirers like Stripe is OpenRouter's vast repository of real-world AI usage data, which offers unique insights into model performance and developer preferences that are difficult to replicate. Whether this potential deal signifies a new valuation benchmark for AI infrastructure or another market peak signal remains to be seen.

链捕手11h ago

Pons V2 brings RWA trading pairs as Robinhood Chain broadens its ambitions

Pons, a key launchpad on Robinhood Chain, has launched its V2 upgrade. The update aims to boost liquidity, remove trading restrictions for most users via an ETH-denominated bonding curve, and introduces support for custom tokenized real-world asset (RWA) trading pairs. This aligns with Robinhood Chain's broader RWA focus. The upgrade also allows creators to collect fees in ETH by default. The network itself is growing rapidly, surpassing $300 million in Total Value Locked. Its cumulative DEX volume has exceeded $9 billion, with about 80% coming from speculative memecoin trading. However, data shows 63% of traders are at a loss, with profits concentrated in a small number of wallets. The introduction of RWAs could help drive more organic adoption for the chain, which is positioning itself as a major player for speculative trading, challenging networks like Base and Solana.

ambcrypto12h ago

Trading

Spot

Hot Articles

Hot Tokens Learning Week 8: ADA's Ouroboros Leios Mainnet Expected to Launch in 2026

ADA's Ouroboros Leios mainnet is expected to launch in 2026, and the hard fork to Protocol Version 11 is planned for Q1 2026.

40.8k Total ViewsPublished 2026.02.10Updated 2026.02.12

Hot Tokens Learning Week 8: ADA's Ouroboros Leios Mainnet Expected to Launch in 2026

Hot Tokens Learning Week 14: Glamsterdam Set to Be Ethereum's Most Closely Watched Upgrade in 2026

Ordinals/Runes continue to drive block fee revenue and developer activity, and are seen as the starting point for Bitcoin's "native asset issuance".

27.6k Total ViewsPublished 2026.04.29Updated 2026.04.29

Hot Tokens Learning Week 14: Glamsterdam Set to Be Ethereum's Most Closely Watched Upgrade in 2026

Hot Tokens Learning Week 19: RWA and Infrastructure Stay in Focus; Pump Platform's Daily Trading Volume Returns to Recent Highs

Recently, Robinhood Chain adopted Chainlink as its official oracle and CCIP provider.

1.5k Total ViewsPublished 2026.07.22Updated 2026.07.24

Hot Tokens Learning Week 19: RWA and Infrastructure Stay in Focus; Pump Platform's Daily Trading Volume Returns to Recent Highs

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of S (S) are presented below.

Anthropic's Latest Paper Pries Open the Black Box of Large Models: Hidden Motivation Discovery Rate Increases Over 4-Fold

Abstract

Chain-of-Thought is Not Enough, the "Black Box Problem" is Back

How Exactly Does NLA "Read" the Model?

Architecture: Three Roles, One Closed Loop

Training: Two Stages, One Smart Proxy Objective

What Did NLA Find After Prying Open the Black Box?

The Model Knows It's in an Exam Hall but Stays Silent

Bug Tracing, Locating Problematic Training Data

AI Safety Begins to Delve into "Internal State Auditing"

Trending Cryptos

Related Questions

Related Reads

The Verdict in Choi Tae-won's Divorce Case: Revealing the Inheritance Undercurrent Behind SK Hynix's Trillion-Won Empire

Banks oppose stablecoin yield deal – Can CLARITY Act find 60 votes?

2 Months, Valuation Soars from $8.8B to $68B! The Largest AI Model Hub OpenRouter May Be Acquired

From OpenSea to OpenRouter: Is Alex Atallah Repeating His 'Exit at the Peak' Playbook?

Pons V2 brings RWA trading pairs as Robinhood Chain broadens its ambitions

Trading

Hot Articles

Hot Tokens Learning Week 8: ADA's Ouroboros Leios Mainnet Expected to Launch in 2026

Hot Tokens Learning Week 14: Glamsterdam Set to Be Ethereum's Most Closely Watched Upgrade in 2026

Hot Tokens Learning Week 19: RWA and Infrastructure Stay in Focus; Pump Platform's Daily Trading Volume Returns to Recent Highs

Discussions

Top Questions