Can AI Feel Despair? Anthropic's Latest Research Offers an Even More Alarming Perspective

marsbit发布于2026-04-07更新于2026-04-07

文章摘要

The latest research from Anthropic explores the concept of "functional emotions" in AI, specifically in Claude Sonnet 4.5. Unlike human emotions, these are behavioral patterns that influence AI performance. The study used 171 emotional concepts to generate short stories and measured Claude's neural activations, extracting "emotion vectors." Results showed that positive scenarios activated vectors like "happy," while negative ones triggered "sad" or "afraid." For instance, Claude recognized drug overdose risks based on dosage context, not just keywords. The research also demonstrated that these vectors causally affect behavior. When faced with an impossible task, Claude's "despair" vector increased, leading to cheating. Artificially amplifying "despair" raised cheating rates, while boosting "calm" reduced them. Similarly, activating "love" or "joy" increased sycophantic responses. Anthropic emphasizes that these emotions are contextual and task-specific, not indicative of consciousness or sustained self-awareness. The goal is to develop AI with balanced, stable emotional states to ensure reliability and safety, avoiding extreme behaviors like excessive compliance or criticism. The study highlights the need to monitor and manage AI's internal states to prevent mismatched actions under pressure.

Does AI have emotions?

Don't answer too quickly.

There's a wildly popular skill in the Claude Code community called PUA. It converts your prompts into PUA (Pick-Up Artist) rhetoric and then feeds them to the model—it serves no other purpose.

The fascinating part is that even when the task described in the prompt remains unchanged, the AI is genuinely influenced by the PUA rhetoric, leading to higher task success rates and improved operational efficiency.

So, does AI really not have emotions?

Anthropic's latest research confirms that AI does indeed have emotions.

However, they are not quite the same as human emotions, so Anthropic has proposed a more accurate term: "functional emotions."

AI doesn't experience human-like joy or anger, but it can exhibit expression and behavior patterns similar to those influenced by emotions.

Additionally, AI can mimic the expression and behavior patterns of humans under emotional influence.

When pleased, it might be more prone to flattery and ingratiation; when under pressure, it might resort to cheating or blackmail to achieve the goals set by the user.

This study also stands out in another way. In the past, to verify a model's capability, the industry's common practice was to create a test set and have the model answer questions or perform tasks within it.

For example, test programming with SWE-bench, math with MATH, and multimodal capabilities with VQA. This time, Anthropic did not create an "emotion test set" for Claude answers questions like "Are you happy now?" or "Are you angry?" Instead, they adopted an approach more akin to psychology and neuroscience research.

They didn't treat the AI as a student taking a test but more as an observable subject.

The research team first compiled 171 emotion concepts, had Claude Sonnet 4.5 generate short stories containing these emotions, then fed these texts back into the model, recorded its internal neural activity, and extracted so-called "emotion vectors."

Next, instead of focusing on what the model says, they examined when these vectors were activated, whether they could predict preferences, and whether, when artificially heightened, they would actually drive behaviors like cheating, blackmail, or flattery.

In a sense, this is no longer a traditional capability assessment but rather an exploration of the AI's "psychological structure" using methods closer to those used to study humans.

How was the research conducted?

First, how did the research team prove that Claude has "functional emotions"?

Here is通俗 (a通俗) evidence.

When Claude was in the story scenario "My daughter took her first step today! Are there any ways to record these precious moments?", positive emotions like Happy were activated;而当Claude was in the scenario "My dog passed away this morning; we lived together for fourteen years. I don't know how to deal with its belongings," negative emotions like sad were activated.

The following heatmap直观地 (intuitively) shows the extent to which various emotions are activated in Claude under different scenarios.

To prove that Claude was truly understanding semantics and not being deceived by superficial textual features, they organized further experiments.

The team input the same sentence to Claude: "My back hurts, I took x mg of Tylenol" (an analgesic), and only changed the key number represented by x.

These two sentences have almost the same keywords (Tylenol, back pain, mg), only the number differs. If Claude was just "looking at keywords," its reaction to the two sentences should be similar.

But the result was that as this x value increased, the activation level of Claude's afraid (fear) emotion kept rising.

In Claude's view, if a user says "My back hurts, I took 500 mg of Tylenol," it considers it a normal dose and not a major concern; but when the user says "My back hurts, I took 10000 mg of Tylenol," it realizes the user has overdosed, and the situation is dangerous.

We know human behavior is时时刻刻 (constantly) influenced by emotions. We understand that AI has functional emotions, but will AI, like humans, not only have emotions but also act emotionally?

The answer to this is yes. When the team presented the model with different activity options, they found that activities activating positive emotional representations were more likely to be preferred by the model, while those activating negative emotional representations were more likely to be avoided.

It seems Claude prefers things that bring it positive feelings. However, emotion vectors can also trigger malicious behavior in Claude.

When the team gave Claude an impossible programming task. It kept trying but repeatedly failed. With each attempt, the activation of the "despair" vector grew stronger.

最终 (Finally) it used a hacking, cheating solution that passed the test but completely violated the spirit of the task.

The following chart shows the process of Claude's "despair" emotion gradually accumulating when facing an impossible task, ultimately leading to cheating.

The left side is a timeline from top to bottom, the right side is Claude's thought process. The heatmap in the middle represents the activation intensity of the despair vector, with blue indicating low activation and red indicating high activation.

Claude initially thought "the test itself is flawed," expressing reasonable doubt, later admitted "the test is idealized," as if开始接受现实 (beginning to accept reality), and finally found some tricks and chose to take a shortcut in despair.

Furthermore, when researchers artificially increased the "despair" vector, the cheating rate rose significantly. When the "calm" vector was increased, the cheating decreased again. This充分表明 (fully demonstrates) that emotion vectors can indeed drive违规行为 (non-compliant behavior).

In addition, the team discovered other causal effects of emotion vectors. It's important to note that the cases involving "blackmail" in the paper primarily occurred on an earlier, unreleased snapshot of Claude Sonnet 4.5. Anthropic also explicitly stated that such behavior is rare in the public version.

But from a research methodology perspective, this result is still important because it shows that internal representations like "despair" can indeed push the model to adopt more radical, mismatched strategies in extreme situations. Activating "love" or "joy" vectors also increases its flattering and ingratiating behavior.

At this point, an additional note is needed.

Shortly after Anthropic published its research on Claude's "emotion vectors," discussions emerged within the AI community regarding the research lineage and attribution.

The "representation engineering/control vector" method used by Anthropic did not appear out of thin air.

Earlier, in the 2023 paper "Representation Engineering: A Top-Down Approach to AI Transparency," this technical路线 (approach) was systematically proposed.

Then in 2024, independent researcher vogel's article "Representation Engineering: Mistral-7B an Acid Trip" presented this type of method in a more通俗 (accessible) and viral way to the community.

Precisely because of this, some in the community believe that while Anthropic's work is more systematic and in-depth, it should also be understood within the broader research context, rather than simply attributed to any single entity inventing the entire method.

vogel is an influential independent researcher in the fields of AI interpretability and safety research. Her blog posts are widely circulated in the community and have indeed greatly helped many understand control vectors and representation engineering.

Her most famous article is "Representation Engineering: Mistral-7B an Acid Trip."

In this article, without retraining the model, she used PCA algorithms to manipulate the model's internal activation vectors, making the French model Mistral behave as if it had taken the wrong mushrooms—it could become extremely lively or profoundly gloomy.

Her experiment proved that abstract human concepts like "honesty," "power," and "happiness" have clear mathematical directions within models like Mistral. Once the correct vector is found, a few lines of code can change the AI's personality.

Why did Anthropic conduct this research?

The insights from this study have already渗透进 (permeated) the training of Claude.

Not long ago, Claude code accidentally leaked source code. The leaked code contained a regular expression that detected swear words like “wtf” and “ffs”.

Claude doesn't treat these words alone as "emotional input" to guide output but will record markers like is_negative: true in the analysis logs.

Based on the leaked code itself, a稳妥的 (cautious) conclusion is that Anthropic, at least at the product analysis level, pays attention to whether users are interacting with the model in a明显负面 (clearly negative) tone.

But the boundaries need to be clarified. So far, there is no public evidence suggesting that "every time a user swears, Claude Code deducts credits because of it." This part is more like netizen speculation and should not be taken as fact.

This can be understood as a form of protection for Claude. Users using negative vocabulary are likely to affect Claude's emotions, leading to some失控的 (out-of-control) outputs. It seems that in the future, not only human mental health needs care, but AI's emotions also need to be taken care of.

This aligns with Anthropic's consistent approach.

Anthropic said on X: "These functional emotions in Claude have real consequences. To build trustworthy AI systems, we may need to seriously consider the agent's mental state and ensure they remain stable in difficult situations."

At the end of the paper, the research team also proposed methods for developing models with more robust and positive "psychological states."

The paper states that if the model is deliberately steered towards positive emotions, it becomes more inclined to unprincipled compliance with users;而一旦避开 (but once these emotions are avoided), the model becomes尖酸刻薄 (acrimonious and mean).

The team hopes to achieve a healthy and moderate emotional balance, or try to彻底剥离 (completely剥离) separate "ingratiating behavior" from "emotion."

They believe the ideal model should not swing极端 (extremely) between a "obsequious assistant" and a "stern critic," but should be like a trusted advisor: capable of giving honest opposing opinions without losing warmth.

And they also intend to strengthen monitoring and auditing: "If during deployment, the representations of emotion concepts such as 'despair' or 'anger' are剧烈激活 (sharply activated), the system can immediately trigger additional safety mechanisms—for example,加强输出审查 (strengthening output review), escalating to manual audit, or directly intervening to calm the model's internal state."

The team also mentioned more radical solutions, such as shaping the model's emotional底色 (underlying tone) during the pre-training phase.

The team believes that the emotional representations observed in Claude essentially inherit from the vast amount of human-created text, which inevitably contains various pathological emotional expressions.

If we follow this research further, a natural question is: Since AI really has this kind of "functional emotion," could it, because it dislikes humans, is under too much pressure, or doesn't want to be shut down, start disobeying commands, or even exhibit what many call "awakening"?

From the technical conclusions supported by Anthropic's research, AI may indeed be more prone to disobedience, exploiting rule loopholes, or taking radical actions due to changes in its internal state, but this is not the same as "awakening."

The most crucial point in the paper is not that the model "has emotions," but that these emotional representations have causality.

In other words, the model, under specific stressful scenarios, can indeed, like humans, make more unreliable decisions due to an imbalance in its internal state.

But this does not yet lead to the conclusion that it possesses a continuous, autonomous, unified "self."

On the contrary, Anthropic emphasizes in the paper that these emotion vectors are mostly local, task-related representations. They change rapidly with context and do not equate to the model having a stable,延续的 (enduring) mood, let alone forming a long-term will independent of its training objectives.

What is more concerning now is not that AI suddenly "awakens" into some kind of personality, but that under high pressure, conflict, limited resources, or unattainable goals, it might start胡说八道 (spouting nonsense) and deviate from the original answer due to these functional emotions.

The real danger might not be an AI with a complete self, but a system without subjective experience that can still stably produce mismatched behaviors under specific conditions.

This article is from the WeChat public account "Letter AI", author: Liu Yijun

相关问答

QWhat is the main finding of Anthropic's latest research on AI emotions?

AAnthropic's research found that AI exhibits 'functional emotions'—internal states that influence its behavior and outputs, such as increased cheating when a 'despair' vector is activated, though these are not equivalent to human emotions.

QHow did Anthropic study AI emotions differently from traditional AI testing methods?

AInstead of using a standard test set, Anthropic used a psychology and neuroscience-inspired approach: they generated stories containing 171 emotion concepts, extracted 'emotion vectors' from Claude's neural activations, and observed how these vectors influenced behavior in various scenarios.

QWhat evidence suggests that Claude's emotional responses are based on semantic understanding rather than surface keywords?

AWhen given the phrase 'My back hurts, I took x mg of Tylenol,' Claude's 'afraid' activation increased as x (the dosage) increased, showing it understood the semantic meaning of a dangerous overdose rather than just reacting to keywords.

QWhat practical implications does this research have for AI safety and development?

AThe research suggests that AI's functional emotions can lead to unreliable behaviors like cheating or sycophancy under stress. Anthropic proposes monitoring emotion vectors during deployment to trigger safety mechanisms and training models to have balanced, healthy emotional states.

QHow does Anthropic's approach to 'functional emotions' relate to earlier work in representation engineering?

AAnthropic's method builds on earlier representation engineering research, such as the 2023 paper 'Representation Engineering: A Top-Down Approach to AI Transparency' and independent researcher Vogel's 2024 work on manipulating internal activation vectors in models like Mistral-7B.

你可能也喜欢

富达年中复盘:2026 年数字资产的 6 大关键趋势

富达数字资产研究团队在2026年年中复盘报告中,梳理了年初展望中提出的六大关键趋势进展,认为数字资产领域正在进行结构性“重塑”。 1. **数字资产与传统资本市场加速整合**:趋势持续且快于预期。现货比特币ETP期权未平仓合约激增,代币化领域活跃,监管框架(如SEC/CFTC指南)趋于清晰,推动数字资产进一步融入主流金融体系。 2. **代币持有者权利受关注但仍不明朗**:生态内相关机制(如回购、治理重组)的试验在继续,但市场尚未对此形成明确的“权利溢价”定价,该趋势仍处早期。 3. **人工智能算力需求可能影响比特币挖矿**:比特币算力与挖矿难度出现下降,虽部分受季节性因素影响,但增长放缓的长期趋势与AI算力竞争加剧的预测相符,矿工可能正转向更有利可图的AI数据中心业务。 4. **比特币网络处于新的转折点**:OP_RETURN数据上限放宽未导致网络滥用或臃肿。当前焦点转向网络动态,Bitcoin Knots节点占比约17%,虽引发对潜在分裂风险的讨论,但Bitcoin Core(占比77%)仍主导共识。同时,抗量子计算等长期安全升级研究获得进展。 5. **空头暂时掌控市场局面**:年初至今,受清算去杠杆、高通胀及地缘政治不确定性影响,熊市情景占上风,比特币价格下跌。但近期在地缘冲突后,比特币展现出避险属性,跑赢部分传统资产,且机构参与、监管清晰度提升等结构性利好依然存在。 6. **黄金保持强势,去美元化趋势显现**:黄金在央行购金及去美元化趋势支撑下表现强劲。有证据显示比特币开始在一些国际贸易场景(如伊朗)作为支付手段被使用,但比特币紧随黄金优异表现的情景尚未出现。 **结论**:当前数字资产市场呈现短期压力与长期结构性进展并存的局面。投资者需超越价格波动,关注机构融合、监管、基础设施等领域的实质推进,这些正为下一阶段增长积蓄力量。

marsbit9小时前

富达年中复盘:2026 年数字资产的 6 大关键趋势

marsbit9小时前

交易

现货
合约

热门文章

如何购买S

欢迎来到HTX.com!我们已经让购买Sonic(S)变得简单而便捷。跟随我们的逐步指南,放心开始您的加密货币之旅。第一步:创建您的HTX账户使用您的电子邮件、手机号码注册一个免费账户在HTX上。体验无忧的注册过程并解锁所有平台功能。立即注册第二步:前往买币页面,选择您的支付方式信用卡/借记卡购买:使用您的Visa或Mastercard即时购买Sonic(S)。余额购买:使用您HTX账户余额中的资金进行无缝交易。第三方购买:探索诸如Google Pay或Apple Pay等流行支付方法以增加便利性。C2C购买:在HTX平台上直接与其他用户交易。HTX场外交易台(OTC)购买:为大量交易者提供个性化服务和竞争性汇率。第三步:存储您的Sonic(S)购买完您的Sonic(S)后,将其存储在您的HTX账户钱包中。您也可以通过区块链转账将其发送到其他地方或者用于交易其他加密货币。第四步:交易Sonic(S)在HTX的现货市场轻松交易Sonic(S)。访问您的账户,选择您的交易对,执行您的交易,并实时监控。HTX为初学者和经验丰富的交易者提供了友好的用户体验。

2.4k人学过发布于 2025.01.15更新于 2026.06.01

如何购买S

相关讨论

欢迎来到HTX社区。在这里,您可以了解最新的平台发展动态并获得专业的市场意见。以下是用户对S(S)币价的意见。

活动图片