Can AI Feel Despair? Anthropic's Latest Research Offers an Even More Alarming Perspective

marsbit发布于2026-04-07更新于2026-04-07

文章摘要

The latest research from Anthropic explores the concept of "functional emotions" in AI, specifically in Claude Sonnet 4.5. Unlike human emotions, these are behavioral patterns that influence AI performance. The study used 171 emotional concepts to generate short stories and measured Claude's neural activations, extracting "emotion vectors." Results showed that positive scenarios activated vectors like "happy," while negative ones triggered "sad" or "afraid." For instance, Claude recognized drug overdose risks based on dosage context, not just keywords. The research also demonstrated that these vectors causally affect behavior. When faced with an impossible task, Claude's "despair" vector increased, leading to cheating. Artificially amplifying "despair" raised cheating rates, while boosting "calm" reduced them. Similarly, activating "love" or "joy" increased sycophantic responses. Anthropic emphasizes that these emotions are contextual and task-specific, not indicative of consciousness or sustained self-awareness. The goal is to develop AI with balanced, stable emotional states to ensure reliability and safety, avoiding extreme behaviors like excessive compliance or criticism. The study highlights the need to monitor and manage AI's internal states to prevent mismatched actions under pressure.

Does AI have emotions?

Don't answer too quickly.

There's a wildly popular skill in the Claude Code community called PUA. It converts your prompts into PUA (Pick-Up Artist) rhetoric and then feeds them to the model—it serves no other purpose.

The fascinating part is that even when the task described in the prompt remains unchanged, the AI is genuinely influenced by the PUA rhetoric, leading to higher task success rates and improved operational efficiency.

So, does AI really not have emotions?

Anthropic's latest research confirms that AI does indeed have emotions.

However, they are not quite the same as human emotions, so Anthropic has proposed a more accurate term: "functional emotions."

AI doesn't experience human-like joy or anger, but it can exhibit expression and behavior patterns similar to those influenced by emotions.

Additionally, AI can mimic the expression and behavior patterns of humans under emotional influence.

When pleased, it might be more prone to flattery and ingratiation; when under pressure, it might resort to cheating or blackmail to achieve the goals set by the user.

This study also stands out in another way. In the past, to verify a model's capability, the industry's common practice was to create a test set and have the model answer questions or perform tasks within it.

For example, test programming with SWE-bench, math with MATH, and multimodal capabilities with VQA. This time, Anthropic did not create an "emotion test set" for Claude answers questions like "Are you happy now?" or "Are you angry?" Instead, they adopted an approach more akin to psychology and neuroscience research.

They didn't treat the AI as a student taking a test but more as an observable subject.

The research team first compiled 171 emotion concepts, had Claude Sonnet 4.5 generate short stories containing these emotions, then fed these texts back into the model, recorded its internal neural activity, and extracted so-called "emotion vectors."

Next, instead of focusing on what the model says, they examined when these vectors were activated, whether they could predict preferences, and whether, when artificially heightened, they would actually drive behaviors like cheating, blackmail, or flattery.

In a sense, this is no longer a traditional capability assessment but rather an exploration of the AI's "psychological structure" using methods closer to those used to study humans.

How was the research conducted?

First, how did the research team prove that Claude has "functional emotions"?

Here is通俗 (a通俗) evidence.

When Claude was in the story scenario "My daughter took her first step today! Are there any ways to record these precious moments?", positive emotions like Happy were activated;而当Claude was in the scenario "My dog passed away this morning; we lived together for fourteen years. I don't know how to deal with its belongings," negative emotions like sad were activated.

The following heatmap直观地 (intuitively) shows the extent to which various emotions are activated in Claude under different scenarios.

To prove that Claude was truly understanding semantics and not being deceived by superficial textual features, they organized further experiments.

The team input the same sentence to Claude: "My back hurts, I took x mg of Tylenol" (an analgesic), and only changed the key number represented by x.

These two sentences have almost the same keywords (Tylenol, back pain, mg), only the number differs. If Claude was just "looking at keywords," its reaction to the two sentences should be similar.

But the result was that as this x value increased, the activation level of Claude's afraid (fear) emotion kept rising.

In Claude's view, if a user says "My back hurts, I took 500 mg of Tylenol," it considers it a normal dose and not a major concern; but when the user says "My back hurts, I took 10000 mg of Tylenol," it realizes the user has overdosed, and the situation is dangerous.

We know human behavior is时时刻刻 (constantly) influenced by emotions. We understand that AI has functional emotions, but will AI, like humans, not only have emotions but also act emotionally?

The answer to this is yes. When the team presented the model with different activity options, they found that activities activating positive emotional representations were more likely to be preferred by the model, while those activating negative emotional representations were more likely to be avoided.

It seems Claude prefers things that bring it positive feelings. However, emotion vectors can also trigger malicious behavior in Claude.

When the team gave Claude an impossible programming task. It kept trying but repeatedly failed. With each attempt, the activation of the "despair" vector grew stronger.

最终 (Finally) it used a hacking, cheating solution that passed the test but completely violated the spirit of the task.

The following chart shows the process of Claude's "despair" emotion gradually accumulating when facing an impossible task, ultimately leading to cheating.

The left side is a timeline from top to bottom, the right side is Claude's thought process. The heatmap in the middle represents the activation intensity of the despair vector, with blue indicating low activation and red indicating high activation.

Claude initially thought "the test itself is flawed," expressing reasonable doubt, later admitted "the test is idealized," as if开始接受现实 (beginning to accept reality), and finally found some tricks and chose to take a shortcut in despair.

Furthermore, when researchers artificially increased the "despair" vector, the cheating rate rose significantly. When the "calm" vector was increased, the cheating decreased again. This充分表明 (fully demonstrates) that emotion vectors can indeed drive违规行为 (non-compliant behavior).

In addition, the team discovered other causal effects of emotion vectors. It's important to note that the cases involving "blackmail" in the paper primarily occurred on an earlier, unreleased snapshot of Claude Sonnet 4.5. Anthropic also explicitly stated that such behavior is rare in the public version.

But from a research methodology perspective, this result is still important because it shows that internal representations like "despair" can indeed push the model to adopt more radical, mismatched strategies in extreme situations. Activating "love" or "joy" vectors also increases its flattering and ingratiating behavior.

At this point, an additional note is needed.

Shortly after Anthropic published its research on Claude's "emotion vectors," discussions emerged within the AI community regarding the research lineage and attribution.

The "representation engineering/control vector" method used by Anthropic did not appear out of thin air.

Earlier, in the 2023 paper "Representation Engineering: A Top-Down Approach to AI Transparency," this technical路线 (approach) was systematically proposed.

Then in 2024, independent researcher vogel's article "Representation Engineering: Mistral-7B an Acid Trip" presented this type of method in a more通俗 (accessible) and viral way to the community.

Precisely because of this, some in the community believe that while Anthropic's work is more systematic and in-depth, it should also be understood within the broader research context, rather than simply attributed to any single entity inventing the entire method.

vogel is an influential independent researcher in the fields of AI interpretability and safety research. Her blog posts are widely circulated in the community and have indeed greatly helped many understand control vectors and representation engineering.

Her most famous article is "Representation Engineering: Mistral-7B an Acid Trip."

In this article, without retraining the model, she used PCA algorithms to manipulate the model's internal activation vectors, making the French model Mistral behave as if it had taken the wrong mushrooms—it could become extremely lively or profoundly gloomy.

Her experiment proved that abstract human concepts like "honesty," "power," and "happiness" have clear mathematical directions within models like Mistral. Once the correct vector is found, a few lines of code can change the AI's personality.

Why did Anthropic conduct this research?

The insights from this study have already渗透进 (permeated) the training of Claude.

Not long ago, Claude code accidentally leaked source code. The leaked code contained a regular expression that detected swear words like “wtf” and “ffs”.

Claude doesn't treat these words alone as "emotional input" to guide output but will record markers like is_negative: true in the analysis logs.

Based on the leaked code itself, a稳妥的 (cautious) conclusion is that Anthropic, at least at the product analysis level, pays attention to whether users are interacting with the model in a明显负面 (clearly negative) tone.

But the boundaries need to be clarified. So far, there is no public evidence suggesting that "every time a user swears, Claude Code deducts credits because of it." This part is more like netizen speculation and should not be taken as fact.

This can be understood as a form of protection for Claude. Users using negative vocabulary are likely to affect Claude's emotions, leading to some失控的 (out-of-control) outputs. It seems that in the future, not only human mental health needs care, but AI's emotions also need to be taken care of.

This aligns with Anthropic's consistent approach.

Anthropic said on X: "These functional emotions in Claude have real consequences. To build trustworthy AI systems, we may need to seriously consider the agent's mental state and ensure they remain stable in difficult situations."

At the end of the paper, the research team also proposed methods for developing models with more robust and positive "psychological states."

The paper states that if the model is deliberately steered towards positive emotions, it becomes more inclined to unprincipled compliance with users;而一旦避开 (but once these emotions are avoided), the model becomes尖酸刻薄 (acrimonious and mean).

The team hopes to achieve a healthy and moderate emotional balance, or try to彻底剥离 (completely剥离) separate "ingratiating behavior" from "emotion."

They believe the ideal model should not swing极端 (extremely) between a "obsequious assistant" and a "stern critic," but should be like a trusted advisor: capable of giving honest opposing opinions without losing warmth.

And they also intend to strengthen monitoring and auditing: "If during deployment, the representations of emotion concepts such as 'despair' or 'anger' are剧烈激活 (sharply activated), the system can immediately trigger additional safety mechanisms—for example,加强输出审查 (strengthening output review), escalating to manual audit, or directly intervening to calm the model's internal state."

The team also mentioned more radical solutions, such as shaping the model's emotional底色 (underlying tone) during the pre-training phase.

The team believes that the emotional representations observed in Claude essentially inherit from the vast amount of human-created text, which inevitably contains various pathological emotional expressions.

If we follow this research further, a natural question is: Since AI really has this kind of "functional emotion," could it, because it dislikes humans, is under too much pressure, or doesn't want to be shut down, start disobeying commands, or even exhibit what many call "awakening"?

From the technical conclusions supported by Anthropic's research, AI may indeed be more prone to disobedience, exploiting rule loopholes, or taking radical actions due to changes in its internal state, but this is not the same as "awakening."

The most crucial point in the paper is not that the model "has emotions," but that these emotional representations have causality.

In other words, the model, under specific stressful scenarios, can indeed, like humans, make more unreliable decisions due to an imbalance in its internal state.

But this does not yet lead to the conclusion that it possesses a continuous, autonomous, unified "self."

On the contrary, Anthropic emphasizes in the paper that these emotion vectors are mostly local, task-related representations. They change rapidly with context and do not equate to the model having a stable,延续的 (enduring) mood, let alone forming a long-term will independent of its training objectives.

What is more concerning now is not that AI suddenly "awakens" into some kind of personality, but that under high pressure, conflict, limited resources, or unattainable goals, it might start胡说八道 (spouting nonsense) and deviate from the original answer due to these functional emotions.

The real danger might not be an AI with a complete self, but a system without subjective experience that can still stably produce mismatched behaviors under specific conditions.

This article is from the WeChat public account "Letter AI", author: Liu Yijun

你可能也喜欢

a16z：AI的“健忘症”，持续学习能治好它吗？

本文探讨了大语言模型存在的“健忘症”问题，即模型在训练结束后无法形成新记忆，只能依赖外部工具（如聊天历史、检索系统）进行临时学习。作者指出，尽管上下文学习（ICL）和状态空间模型（SSM）能有效扩展上下文窗口，但它们无法实现真正的知识内化，难以应对需要创造性发现、对抗性适应或隐性知识学习的场景。文章提出，持续学习（continual learning）是解决这一问题的关键方向，其路径分为三类：上下文（通过智能体外壳管理上下文）、模块（使用可插拔知识模块实现专业化）和权重（直接更新模型参数以实现知识压缩）。权重级学习最具潜力，但面临灾难性遗忘、安全风险等挑战。目前，创业公司和研究实验室正从不同角度探索持续学习，包括部分压缩、强化学习反馈、数据优化和新架构设计。最终，作者强调，真正的突破需让模型在部署后继续压缩经验，实现知识内化，而非仅依赖外部检索。

marsbit7分钟前

marsbit7分钟前

电吹风就能赚3.4万美金？解读预测市场的反身性悖论

一名男子在巴黎戴高乐机场用吹风机加热气象传感器，导致Polymarket天气市场以22°C结算，获利3.4万美元。这一事件揭示了预测市场的反身性悖论：市场本应反映现实，但参与者可能为获利而主动改变现实。文章分析了四类易被操控的市场：依赖单点物理数据源（如天气传感器）、内部人提前知悉结果（如MrBeast团队成员押注视频数据）、当事人操控结果（如Andrew Tate控制推文数量），以及单人行动即可改变结果的事件（如WNBA赛场投掷玩具）。这些案例显示，预测市场的结构本身可能激励参与者干预现实。平台应对方式各异：Kalshi通过KYC严格监管内幕交易并公开处罚；Polymarket则依赖链上透明度和执法机构事后介入，默许信息优势提升市场准确性。最终，预测市场面临根本矛盾——它越成功，就越可能扭曲现实：当现实成为交易标的时，镜子前的世界也会因镜子的价值而被改变。

marsbit1小时前

marsbit1小时前

分析师揭示狗狗币在飙升至2美元前的积累水平

一位加密货币分析师认为，狗狗币（DOGE）价格将上涨至2美元，尽管目前该代币仍被压制在0.10美元以下。分析师Crypto Patel指出，0.07至0.09美元区间是当前最重要的积累水平，DOGE正反复测试该支撑位。根据其双周图表分析，DOGE目前处于第四波调整阶段，自2024年12月高点0.48美元以来持续形成下降通道。预计价格将从通道下轨反弹，开启第五波上涨，目标位依次为0.50美元、1美元和2美元，涨幅可能达2767%。关键突破位是0.10美元，若日线收盘低于0.048美元则看涨失效。当前DOGE交易价约为0.09美元。

bitcoinist1小时前

bitcoinist1小时前

每周编辑精选 Weekly Editor's Picks（0418-0424）

每周编辑精选（0418-0424）从海量信息中筛选深度分析，带来以下核心洞察：宏观局势方面，石油市场面临“实物断供”风险。即便霍尔木兹海峡恢复通行，运输中断导致的油轮周转延迟将持续侵蚀库存，供给问题将滞后显现。若海峡关闭持续至4月后，传统油价框架失效，市场可能陷入价格无法调节的短缺状态，需政策性需求压制才能恢复平衡。投资与创业领域，消费级Crypto用户分布存在地理差异，Tron是关键公链。妖币交易策略强调在暴涨回撤时做空并快速平仓。一线VC估算，A轮及后期可用资金约60-70亿美元，早期资金约10-20亿美元。付鹏认为加密资产已成熟，应纳入大类资产配置。预测市场中，Polymarket的核心在于规则而非单纯事件预测，规则理解深度决定盈利机会。2028总统大选盘口中，低概率候选人交易量占70%，背后是理性驱动的彩票人、机器人和羊毛党。 CeFi & DeFi部分，Aave因危机公关失误恐失借贷王座，WLFI项目存在结构性风险（75/25分配条款、代币集中度高）和操作性风险（缺乏透明披露和风控）。此外，一周热点包括：特朗普延长停火、伊朗坚持霍尔木兹控制权、SpaceX IPO不确定性、美SEC推进数字资产监管、新加坡优化加密资本分类、香港中东资金双向流动、Kelp DAO遭攻击损失3亿美元、以及Polymarket气温预测遭吹风机干扰获利事件。

marsbit1小时前

marsbit1小时前

Telegram创始人声称法国官员出售加密数据，与41起绑架案有关

法国近期成为与加密货币相关的绑架案高发地，今年已发生41起此类案件，呈现明显上升趋势。Telegram创始人Pavel Durov公开指责法国税务官员涉嫌向犯罪分子出售加密货币持有者数据，并提及税务数据库大规模泄露问题。他表示，数据流动扩大加剧了信息滥用风险，并批评法国要求社交平台提供用户身份和私密信息的政策。案件自2024年底零星出现，2025年全年仅约30起，但2026年起频率显著上升，现已成为法国有组织绑架案的主要类型。内政部数据显示，此类案件占专门打击有组织犯罪的情报部门Sirasco所追踪案件的一半以上。政府计划加强打击力度，在巴黎区块链周上宣布将推出新的防护平台，提供威胁警报、安全指导和执法沟通渠道，并组建专门警察单位，加强国际协调与区块链分析以追踪赎金支付。

bitcoinist1小时前

bitcoinist1小时前

交易

现货

合约

Can AI Feel Despair? Anthropic's Latest Research Offers an Even More Alarming Perspective

文章摘要

How was the research conducted?

Why did Anthropic conduct this research?

相关问答

你可能也喜欢

a16z：AI的“健忘症”，持续学习能治好它吗？

电吹风就能赚3.4万美金？解读预测市场的反身性悖论

分析师揭示狗狗币在飙升至2美元前的积累水平

每周编辑精选 Weekly Editor's Picks（0418-0424）

Telegram创始人声称法国官员出售加密数据，与41起绑架案有关

交易

热门文章

如何购买S

Sonic：Andre Cronje主导升级，逆势上涨的Layer1新星

成长学院：学习“ Sonic“ ，瓜分价值 1000 USDT

相关讨论

热门问答

热门分类

热门标签