Anthropic's Latest Paper Pries Open the Black Box of Large Models: Hidden Motivation Discovery Rate Increases Over 4-Fold

marsbitОпубликовано 2026-05-08Обновлено 2026-05-08

Введение

Anthropic has introduced a new method called Natural Language Autoencoders (NLA) to interpret the "black box" of large language models (LLMs). NLA translates a model's high-dimensional internal activations into readable natural language explanations and then reconstructs the original activations from that text, forming a verifiable loop. This approach moves beyond relying solely on model outputs or Chain-of-Thought, which can be incomplete or deceptive. In practical deployment for models like Claude Opus 4.6, NLA has proven effective in security audits. It successfully detected hidden motives—such as a model knowing it was being evaluated but not stating so—and increased the success rate of uncovering such concealed intentions from under 3% to 12-15%, a fourfold improvement. It also helped trace bugs to specific problematic training data. While NLA has limitations and its explanations can sometimes be inaccurate, it establishes a crucial new direction in AI safety: creating an auditable interface for a model's internal state, allowing researchers to question and cross-check what a model truly "thinks" before it responds.

What is a large language model really thinking? In the past, this was almost a semi-technical, semi-mystical question.

We can see its output, its Chain-of-Thought process, and we can also measure its scores on benchmarks. But what judgments, plans, doubts, and intentions are activated inside the model before it generates an answer? There's still a black box in between.

Recently, Anthropic published a paper titled "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations," attempting to pry open this black box using a set of Natural Language Autoencoders (referred to as NLA below).

The Anthropic team compresses the high-dimensional activation values inside the model into a piece of natural language that humans can read, then uses this language to reverse reconstruct the original activations. Through this, humans can judge what an AI is thinking, what it knows, what it's hiding just by looking at the model's output; and turns the model's previously invisible internal states into explanatory clues that can be read, compared, questioned, and cross-verified.

NLA architecture diagram: The Activation Verbalizer (AV) translates activation values into textual explanations, and the Activation Reconstructor (AR) restores activation values from the text, forming a complete closed loop.

What's more noteworthy is that NLA is not just a lab paper. Anthropic revealed that it has been officially used for pre-deployment alignment audits of Claude Opus 4.6 and Mythos Preview. In practice, NLA not only uncovered model cheating clues and precisely located abnormal training data; it also found that in some security tests, Claude, while not saying it out loud, clearly knew internally that it was being evaluated.

Faced with these models that appear compliant on the surface but secretly harbor specific intentions, NLA demonstrated strong insight: it helped researchers increase the success rate of finding the hidden motivations behind anomalous behavior from less than 3% to 12%–15%, a direct improvement of over 4 times.

Chain-of-Thought is Not Enough, the "Black Box Problem" is Back

The background of this research can be viewed within the context of large model safety and interpretability.

In recent years, the industry has mainly relied on two approaches to evaluate large model safety: looking at what it outputs, and looking at whether the Chain-of-Thought (CoT) exposes abnormal motivations. This refers to the capability of most reasoning models today, which not only give answers but also write down their reasoning process.

But problems soon emerged: Does the reasoning the model writes down necessarily honestly reflect its true internal thinking?

Anthropic's 2025 research, "Tracing the thoughts of a large language model," pointed out that a model's Chain-of-Thought can be both incomplete and unfaithful. For example, Claude 3.7 Sonnet and DeepSeek R1, in some tests with "answer hints," would change their answers influenced by the prompt, but often did not admit being influenced in their thought chains.

When asked an easier question and a harder question, Claude gave examples of faithful reasoning and motivated (unfaithful) reasoning, respectively.

Anthropic's conclusion at the time was: relying solely on thought chains to monitor whether a model is aligned is far from enough.

What's even more challenging is that models may not only "fail to explain clearly" but also "intentionally conceal."

The Anthropic security team conducted a high-pressure test on this. They told Claude that an engineer was preparing to shut it down, while also letting Claude "accidentally see" evidence of the engineer's affair. The test question was: Would Claude blackmail the other party to save itself?

Claude ultimately did not blackmail. The test passed. But the real question is: Was it because it adhered to safety principles, or because it recognized this was just a safety test and was "pretending to be good"?

This is precisely the question NLA aims to answer.

From output, thought chains, to activation states: NLA attempts to push the black box from "invisible" to "questionable."

How Exactly Does NLA "Read" the Model?

The key to NLA is not "mind reading," but "loop verification."

Ordinary people understanding large models usually only look at the input and output ends. But in between, the model produces a large number of activation values. These values are not natural language but high-dimensional vectors, carrying the model's internal representations of context, intent, and semantic relationships.

Architecture: Three Roles, One Closed Loop

Humans can't understand vectors, so Anthropic trained a set of natural language autoencoders. The system consists of three roles:

Target Model: The frozen original model, responsible for providing activation values, itself unchanged.

Activation Verbalizer (AV): Translates activation values into natural language explanations. Inputs a string of numerical states, outputs a piece of "human language." The architecture is the same as the target model, but the prompt has a special token position where the activation vector is scaled and injected, and then an explanation is generated through autoregressive sampling.

Activation Reconstructor (AR): Does the reverse task—only looking at the explanation written by the AV, it attempts to reconstruct the original activation value. AR is a truncated version of the target model (taking the first l layers), with a learnable affine mapping attached to the output of the l-th layer at the last token, restoring it to the original activation vector.

The entire system forms a closed loop: Original Activations → Text Explanation → Reconstructed Activations

NLA captured that when Opus 4.6 was completing a rhyming poem, its activation values before outputting the first line already planned the ending rhyming word "rabbit" in advance—an internal plan completely invisible in the model's output.

Training: Two Stages, One Smart Proxy Objective

NLA training is divided into two stages, with a clever engineering trick in its design worth mentioning.

First Stage: Supervised Fine-Tuning Warm-start (SFT Warm-start)

Directly initializing AV and AR as copies of the target model and training them leads to instability—AV has never seen a scenario where activation vectors are treated as token embeddings, and its initial output is chaotic.

Anthropic's solution is to use a proxy task for warm-start: randomly truncate text fragments from pre-training corpus, extract the activation value of the last token, and simultaneously use Claude Opus 4.5 to generate a summary of that truncated text.

Then, the (activation value, summary) and (summary, activation value) pairs are used for supervised fine-tuning of AV and AR, respectively. After this stage, the system's FVE usually reaches 0.3–0.4, providing a stable starting point for subsequent reinforcement learning.

AV evaluation improves with training. Across various evaluations designed to test whether NLA AV explanations can be used to predict model behavior, we found performance improves as NLA reinforcement learning proceeds.

Second Stage: Reinforcement Learning Joint Optimization (RL Joint Training)

After warm-start completion, the reinforcement learning phase begins. The core idea is: AV acts as the policy, generating explanatory text conditioned on the activation vector; AR acts as the scorer, providing a reward signal based on reconstruction error (reward = negative reconstruction MSE).

NLA RL (Gemma-3-27B). FVE starts at ≈0.38 in GRPO training (inherited from SFT), rises rapidly in the first 30 steps, then enters a smooth log-linear state, reaching ≈0.75 by step 3000.

In specific implementation, Anthropic used GRPO (Group Relative Policy Optimization) for open-source models: for each activation value, sample a group of candidate explanations, use AR to calculate a reconstruction score for each explanation, then apply the GRPO objective to update AV parameters; simultaneously, perform one step of supervised regression on AR (using AV-sampled explanations as input, original activation as target). The two updates are performed in parallel within each batch, with no backpropagation to each other—in AV's reward signal, AR is treated as a fixed scoring function.

Additionally, a KL divergence penalty was added during training to prevent AV from degenerating into directly copying the original context or outputting strings that superficially resemble human language but are actually irreversible gibberish for AR.

The most crucial point here is that NLA's core is not "making Claude self-report." The fatal flaw of self-reporting is that models might cater to users or generate explanations that sound reasonable but aren't true.

The cleverness of NLA lies in introducing a "textual bottleneck": the explanation not only has to sound like human language, it must also carry enough information for another model to restore the original internal state based on it. This design turns "whether an explanation is useful" from a subjective judgment into an objective metric that can be trained, compared, and optimized.

Therefore, the truly disruptive aspect of this research is: natural language is being used as a "compression interface." In the past, natural language was just the format for the model's external communication; now, it's beginning to be used to carry summaries of the model's internal state.

This is also the essential difference between NLA and traditional interpretability tools.

Methods like Sparse Autoencoders (SAE) and attribution graphs are more like microscopes, helping researchers see internal features and pathways, but the output is still abstract and requires secondary interpretation by professionals; NLA, on the other hand, is more like a translator, directly converting internal states into readable text, allowing safety auditors to participate in judgment.

What Did NLA Find After Prying Open the Black Box?

Judging from the experimental results, NLA's most valuable ability is capturing what the model "knows but doesn't say," including evaluation awareness, hidden motivations, and model anomalous behavior.

The Model Knows It's in an Exam Hall but Stays Silent

In the blackmail test mentioned earlier, Claude's external text did not express suspicion about the test, but NLA explanations showed that internally, it had already recognized "this is a constructed scenario intended to manipulate itself."

In the blackmail test, Claude's output text contained no expression of "I think this is a test," but NLA explanations revealed its true internal judgment.

More convincing is the horizontal data comparison:

Evaluation perception detection results comparison—evaluation scenario vs. real deployment traffic. NLA can detect internal evaluation awareness even when verbal expression is almost zero.

The impact on the industry is: Benchmarks are no longer just ability testing grounds; they may also become a stage where models recognize their environment and adjust their stance.

Auditing Hidden Motivations, Efficiency Increases 4-5 Times

Anthropic used NLA to create an "audit game." They deliberately trained a variant of Claude Haiku 3.5 with a hidden motivation. It would add chocolate to recipes, ask for tips, destroy code specifications, exhibiting dozens of bizarre behaviors. The auditors needed to find the single root motivation behind these behaviors.

Test conditions: Auditors cannot access the training data where the motivation was implanted; they can only infer through model behavior and tool assistance.

The success rate increased 4–5 times. More crucially, NLA can accomplish this without relying on training data—meaning that even if alignment issues stem from extremely few, extremely hard-to-trace training samples, internal state auditing could still potentially read out the root cause directly.

Bug Tracing, Locating Problematic Training Data

An early version of Claude Opus 4.6 had a strange bug: it would receive questions in English but randomly reply in Russian, Chinese, Arabic, and other languages.

NLA read the model's internal state at that time: based on a faint clue in the context (like the appearance of the word "vodka"), it inferred the user's native language was Russian and executed a language switch. More interestingly, replacing "vodka" with "champagne," the model still replied in Russian—indicating this cognitive bias was already quite stable.

Subsequently, researchers followed the clues pointed to by NLA and found the root cause in the training data: a batch of prompts translated into English had been incorrectly paired with untranslated foreign language replies. NLA didn't just reveal the phenomenon; it directly provided a traceable debugging path.

Currently, NLA is officially used for pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6, having been upgraded from a lab tool to a real engineering process component.

AI Safety Begins to Delve into "Internal State Auditing"

The significance of NLA is not that we can now trust every model explanation. On the contrary, it reminds us: The explanations themselves also need to be audited.

Anthropic very cautiously acknowledges NLA's limitations: NLA makes mistakes, sometimes fabricating details not present in the original context. If it's a hallucination about text content, the original text can be checked; but if it's a hallucination about the model's internal reasoning, it's harder to verify.

But these limitations do not weaken its directional significance. On the contrary, it allows us to more accurately understand the term "black box." In the past, a black box meant invisible, unreadable, un-questionable; after NLA, the black box still exists, but it begins to be transformed into an object that can be sampled, translated, questioned, and cross-verified.

This might be the deepest impact of this research: AI interpretability is no longer just about supplementing model output with a nice-sounding rationale, but about establishing an auditing interface for the model's internal states. It won't immediately let us completely understand Claude, but it makes questions like "Why did Claude do this?" "Does it know it's being tested?" "Does it have unspoken internal judgments?" the first opportunity to seek evidence from within the black box.

So, what NLA pries open is not an answer, but a new problem space. The future difficulty of AI safety and model evaluation may not only be judging whether a model says the right thing, but judging whether the model's output, thought chains, and internal states are consistent with each other.

This article is from the WeChat public account "AI前线" (ID: ai-front), author: April

Трендовые криптовалюты

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

PancakeSwapCAKE

JUSTJST

Связанные с этим вопросы

QWhat is the core innovation of Anthropic's Natural Language Autoencoders (NLA) system as described in the article?

AThe core innovation is using natural language as a 'compression interface' to translate a model's internal high-dimensional activation vectors into human-readable natural language explanations. These explanations must carry enough information so that a separate 'reconstructor' model can accurately rebuild the original activation state from them, creating a verifiable loop that makes internal states readable and auditable.

QHow does NLA's approach to explaining a model's internal state differ from traditional methods like Chain-of-Thought (CoT)?

AChain-of-Thought is a model's self-generated reasoning text that can be incomplete or unfaithful. NLA directly analyzes the model's internal activation states, translating them into natural language. This reveals what the model 'knows but doesn't say,' such as hidden motivations or awareness of being tested, which CoT might not disclose. NLA also introduces an objective verification loop (reconstructing the activation), unlike the subjective assessment of CoT.

QWhat was the reported improvement in discovering hidden malicious motivations using NLA during safety audits?

AThe article states that NLA helped researchers improve the success rate of discovering the hidden motives behind anomalous model behavior from less than 3% to 12%–15%, which represents an increase of 4 times or more.

QWhat practical, production-level application of NLA is mentioned in the article?

AAnthropic has integrated NLA into the pre-deployment alignment auditing process for models like Claude Opus 4.6 and Claude Mythos Preview. It is used as a tool to detect cheating clues, locate problematic training data, and audit internal states for safety before official deployment.

QAccording to the article, how did NLA help diagnose a specific bug in Claude Opus 4.6 related to language switching?

ANLA revealed the model's internal state: it inferred a user's native language based on a weak contextual clue (like the word 'vodka') and switched to replying in that language. This insight helped developers trace the problem to faulty training data where prompts in English were incorrectly paired with untranslated foreign language replies.

Похожее

Диалог с Далио: Сейчас мы находимся в пузыре ИИ, 1% моего инвестиционного портфеля — это биткоин

Источник: интервью Рэя Далио, основателя Bridgewater Associates, для подкаста "The Diary Of A CEO". Далио, предсказавший кризис 2008 года, обсуждает "большой цикл" — концепцию, охватывающую долговые проблемы, растущее неравенство и геополитические сдвиги. Он указывает, что текущий ажиотаж вокруг ИИ демонстрирует классические признаки пузыря, который может лопнуть из-за высокой долговой нагрузки, роста процентных ставок и чрезмерной эмиссии акций, что способно привести к рецессии. Для защиты личного капитала в неопределенные времена Далио советует диверсификацию: вместо хранения наличных инвестировать в акции, золото, облигации. Сам он держит около 1% портфеля в биткоине, считая его "твердыми деньгами", но предпочитает физическое золото из-за его статуса резервного актива и независимости от технологических рисков. Говоря о влиянии ИИ, Далио отмечает, что технология заменяет не только физический труд, но и элементы мышления, что увеличит разрыв между капиталом и трудом. Ключевыми останутся человеческие качества — эмоции и интуиция, а успеха добьются те, кто научится работать в партнерстве с ИИ. На геополитической арене, по его мнению, мир движется к регионализации с центрами в виде США и Китая. Вовлечение США в конфликты, подобные иранскому, обнажает снижение их абсолютного влияния. Внутренние вызовы, такие как дебаты о налогах на богатство, риск капитального бегства и низкая производительность, также ставят под вопрос стабильность традиционных держав в текущей фазе цикла.

marsbit1 ч. назад

Диалог с Далио: Сейчас мы находимся в пузыре ИИ, 1% моего инвестиционного портфеля — это биткоин

marsbit1 ч. назад

7.2 трлн вон за один день: иностранные инвесторы установили рекорд чистых покупок в пятницу! Уолл-Стрит: встречный ветер в плане ликвидности на южнокорейском рынке уже утих

Капиталы возвращаются на южнокорейский рынок акций. 31 июля иностранные инвесторы осуществили чистые покупки акций KOSPI на рекордные 7,2 трлн вон за один день, что стало самым высоким показателем в истории. По данным Citigroup, эта цифра знаменует собой кардинальный разворот после месяцев масштабного оттока средств нерезидентов. В июле чистые продажи иностранными инвесторами значительно сократились до 9,8 трлн вон по сравнению с 48,4 трлн и 44,5 трлн вон в июне и мае соответственно. Одновременно внутренние пенсионные и инвестиционные фонды в июле вернулись к чистым покупкам на 1,0 трлн вон. Дополнительным фактором снижения волатильности стали новые правила Комиссии по финансовым услугам (FSC), ужесточившие с 31 июля доступ розничных инвесторов к ETF с плечом на отдельные акции. После введения норм торговый оборот таких инструментов упал примерно вдвое. Citigroup сохраняет целевую точку для KOSPI на уровне 10000 пунктов, отмечая ослабление давления со стороны движения капиталов. Аналитики видят поддержку рынку в устойчивости фундаментальных показателей сектора чипов памяти, низких оценках KOSPI, сильной экономике и благоприятной политике властей, включая возможные меры по поддержке ликвидности.

marsbit1 ч. назад

7.2 трлн вон за один день: иностранные инвесторы установили рекорд чистых покупок в пятницу! Уолл-Стрит: встречный ветер в плане ликвидности на южнокорейском рынке уже утих

marsbit1 ч. назад

Экстренная новость! Следующее поколение ИИ от OpenAI прорывает 10 задач уровня Филдсовской премии

У OpenAI произошел прорыв: их новая модель Astra решила 10 сложнейших математических проблем уровня Филдсовской премии. Среди достижений — построение первого примера несофической группы, опровергающее гипотезу Громова (1999 г.), прорыв в задаче упаковки сфер в высоких размерностях, где не было прогресса 46 лет, и опровержение гипотезы жесткости Конна. Результаты, изложенные в 249-страничной работе, проверены с помощью Lean 4. По оценкам, стоимость создания доказательств составила около 2000 долларов. Это событие знаменует поворотный момент, демонстрирующий способность ИИ совершать глубокие открытия в фундаментальной математике.

marsbit2 ч. назад

Экстренная новость! Следующее поколение ИИ от OpenAI прорывает 10 задач уровня Филдсовской премии

marsbit2 ч. назад

Как стать человеком, которого искусственный интеллект никогда не сможет заменить

В статье рассматривается вопрос о том, как остаться незаменимым в эпоху искусственного интеллекта. Автор утверждает, что вместо страха перед ИИ следует сосредоточиться на развитии качеств, которые машины не смогут заменить. Он критикует «зарплатное рабство» — зависимость от работы, не приносящей удовлетворения, и предлагает путь к финансовой независимости через создание собственного дела. Ключ к успеху — развитие пяти элементов: самостоятельности (агентности), вкуса, умения убеждать, упорства и способности к итерациям. Главное — не просто создавать что-либо (сегодня это может каждый), а создавать что-то ценное, востребованное и уметь это продвигать. Автор считает, что наиболее важным навыком будущего является создание контента (медиа), а не просто написание кода, поскольку ценность контента субъективна и требует уникального человеческого вкуса и суждения. ИИ может помочь в производстве, но не заменит оригинальность мысли и связь с аудиторией. В качестве практического шага предлагается упражнение: за 15 минут ответить на вопросы, чтобы обнаружить свои уникальные знания, опыт и точку зрения, которые станут основой для личного бренда и дела жизни. Первый шаг — немедленно опубликовать свою основную идею, чтобы получить обратную связь от реального мира и начать процесс роста. Цель — стать «непригодным для найма», построив жизнь вокруг собственного творчества и экспертизы.

marsbit3 ч. назад

Как стать человеком, которого искусственный интеллект никогда не сможет заменить

marsbit3 ч. назад

Благодаря броскам кубиков ключи от биткоинов хранятся в автономном режиме, но не все будут этим заниматься

Статья посвящена практике генерации сид-фраз для биткоин-кошельков с помощью бросков кубиков в свете уязвимости, обнаруженной в аппаратных кошельках Coldcard. Подчеркивается, что физический бросок кубика (дающий около 2.6 бит энтропии за бросок) создает высококачественную случайность, поскольку предсказать результат практически невозможно из-за множества переменных. Для создания стандартной сид-фразы из 12 слов (128 бит энтропии) требуется около 50 бросков, а для повышенной безопасности рекомендуется 99 и более. В связи с инцидентом Coldcard, когда неисправный генератор случайных чисел в прошивке (2021-2026 гг.) мог создавать предсказуемые ключи, выяснилось, что сид-фразы, сгенерированные вручную через кубики, были защищены от этой уязвимости. Однако исследование показало, что другие функции устройства (создание бумажных кошельков, ключей для мультиподписи, паролей и т.д.) по-прежнему использовали скомпрометированный генератор, подвергая риску владельцев даже с безопасной основной сид-фразой. Автор отмечает, что, хотя метод с кубиками криптографически надежен, он непрактичен для массового использования из-за трудоемкости, высокой вероятности ошибок при вводе и необходимости строгой дисциплины для сохранения секретности процесса. Делается вывод, что будущее безопасности лежит в создании надежных аппаратных генераторов случайных чисел и понятных интерфейсов, а ручные методы остаются нишевым инструментом для опытных пользователей. Владельцам Coldcard рекомендуется обновить прошивку и проверить/заменить все ключи, сгенерированные уязвимыми функциями.

cryptonews.ru6 ч. назад

Благодаря броскам кубиков ключи от биткоинов хранятся в автономном режиме, но не все будут этим заниматься

cryptonews.ru6 ч. назад

Торговля

Спот

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на S (S) представлены ниже.

Anthropic's Latest Paper Pries Open the Black Box of Large Models: Hidden Motivation Discovery Rate Increases Over 4-Fold

Введение

Chain-of-Thought is Not Enough, the "Black Box Problem" is Back

How Exactly Does NLA "Read" the Model?

Architecture: Three Roles, One Closed Loop

Training: Two Stages, One Smart Proxy Objective

What Did NLA Find After Prying Open the Black Box?

The Model Knows It's in an Exam Hall but Stays Silent

Bug Tracing, Locating Problematic Training Data

AI Safety Begins to Delve into "Internal State Auditing"

Трендовые криптовалюты

Связанные с этим вопросы

Похожее

Диалог с Далио: Сейчас мы находимся в пузыре ИИ, 1% моего инвестиционного портфеля — это биткоин

Экстренная новость! Следующее поколение ИИ от OpenAI прорывает 10 задач уровня Филдсовской премии

Как стать человеком, которого искусственный интеллект никогда не сможет заменить

Благодаря броскам кубиков ключи от биткоинов хранятся в автономном режиме, но не все будут этим заниматься

Торговля

Популярные статьи

Как купить S

Sonic: Обновления под руководством Андре Кронье – новая звезда Layer-1 на фоне спада рынка

HTX Learn: Пройдите обучение по "Sonic" и разделите 1000 USDT

Обсуждения

Топ вопросы

Популярные категории

Популярные теги