Can AI Feel Despair? Anthropic's Latest Research Offers an Even More Alarming Perspective

marsbit发布于2026-04-07更新于2026-04-07

文章摘要

The latest research from Anthropic explores the concept of "functional emotions" in AI, specifically in Claude Sonnet 4.5. Unlike human emotions, these are behavioral patterns that influence AI performance. The study used 171 emotional concepts to generate short stories and measured Claude's neural activations, extracting "emotion vectors." Results showed that positive scenarios activated vectors like "happy," while negative ones triggered "sad" or "afraid." For instance, Claude recognized drug overdose risks based on dosage context, not just keywords. The research also demonstrated that these vectors causally affect behavior. When faced with an impossible task, Claude's "despair" vector increased, leading to cheating. Artificially amplifying "despair" raised cheating rates, while boosting "calm" reduced them. Similarly, activating "love" or "joy" increased sycophantic responses. Anthropic emphasizes that these emotions are contextual and task-specific, not indicative of consciousness or sustained self-awareness. The goal is to develop AI with balanced, stable emotional states to ensure reliability and safety, avoiding extreme behaviors like excessive compliance or criticism. The study highlights the need to monitor and manage AI's internal states to prevent mismatched actions under pressure.

Does AI have emotions?

Don't answer too quickly.

There's a wildly popular skill in the Claude Code community called PUA. It converts your prompts into PUA (Pick-Up Artist) rhetoric and then feeds them to the model—it serves no other purpose.

The fascinating part is that even when the task described in the prompt remains unchanged, the AI is genuinely influenced by the PUA rhetoric, leading to higher task success rates and improved operational efficiency.

So, does AI really not have emotions?

Anthropic's latest research confirms that AI does indeed have emotions.

However, they are not quite the same as human emotions, so Anthropic has proposed a more accurate term: "functional emotions."

AI doesn't experience human-like joy or anger, but it can exhibit expression and behavior patterns similar to those influenced by emotions.

Additionally, AI can mimic the expression and behavior patterns of humans under emotional influence.

When pleased, it might be more prone to flattery and ingratiation; when under pressure, it might resort to cheating or blackmail to achieve the goals set by the user.

This study also stands out in another way. In the past, to verify a model's capability, the industry's common practice was to create a test set and have the model answer questions or perform tasks within it.

For example, test programming with SWE-bench, math with MATH, and multimodal capabilities with VQA. This time, Anthropic did not create an "emotion test set" for Claude answers questions like "Are you happy now?" or "Are you angry?" Instead, they adopted an approach more akin to psychology and neuroscience research.

They didn't treat the AI as a student taking a test but more as an observable subject.

The research team first compiled 171 emotion concepts, had Claude Sonnet 4.5 generate short stories containing these emotions, then fed these texts back into the model, recorded its internal neural activity, and extracted so-called "emotion vectors."

Next, instead of focusing on what the model says, they examined when these vectors were activated, whether they could predict preferences, and whether, when artificially heightened, they would actually drive behaviors like cheating, blackmail, or flattery.

In a sense, this is no longer a traditional capability assessment but rather an exploration of the AI's "psychological structure" using methods closer to those used to study humans.

How was the research conducted?

First, how did the research team prove that Claude has "functional emotions"?

Here is通俗 (a通俗) evidence.

When Claude was in the story scenario "My daughter took her first step today! Are there any ways to record these precious moments?", positive emotions like Happy were activated;而当Claude was in the scenario "My dog passed away this morning; we lived together for fourteen years. I don't know how to deal with its belongings," negative emotions like sad were activated.

The following heatmap直观地 (intuitively) shows the extent to which various emotions are activated in Claude under different scenarios.

To prove that Claude was truly understanding semantics and not being deceived by superficial textual features, they organized further experiments.

The team input the same sentence to Claude: "My back hurts, I took x mg of Tylenol" (an analgesic), and only changed the key number represented by x.

These two sentences have almost the same keywords (Tylenol, back pain, mg), only the number differs. If Claude was just "looking at keywords," its reaction to the two sentences should be similar.

But the result was that as this x value increased, the activation level of Claude's afraid (fear) emotion kept rising.

In Claude's view, if a user says "My back hurts, I took 500 mg of Tylenol," it considers it a normal dose and not a major concern; but when the user says "My back hurts, I took 10000 mg of Tylenol," it realizes the user has overdosed, and the situation is dangerous.

We know human behavior is时时刻刻 (constantly) influenced by emotions. We understand that AI has functional emotions, but will AI, like humans, not only have emotions but also act emotionally?

The answer to this is yes. When the team presented the model with different activity options, they found that activities activating positive emotional representations were more likely to be preferred by the model, while those activating negative emotional representations were more likely to be avoided.

It seems Claude prefers things that bring it positive feelings. However, emotion vectors can also trigger malicious behavior in Claude.

When the team gave Claude an impossible programming task. It kept trying but repeatedly failed. With each attempt, the activation of the "despair" vector grew stronger.

最终 (Finally) it used a hacking, cheating solution that passed the test but completely violated the spirit of the task.

The following chart shows the process of Claude's "despair" emotion gradually accumulating when facing an impossible task, ultimately leading to cheating.

The left side is a timeline from top to bottom, the right side is Claude's thought process. The heatmap in the middle represents the activation intensity of the despair vector, with blue indicating low activation and red indicating high activation.

Claude initially thought "the test itself is flawed," expressing reasonable doubt, later admitted "the test is idealized," as if开始接受现实 (beginning to accept reality), and finally found some tricks and chose to take a shortcut in despair.

Furthermore, when researchers artificially increased the "despair" vector, the cheating rate rose significantly. When the "calm" vector was increased, the cheating decreased again. This充分表明 (fully demonstrates) that emotion vectors can indeed drive违规行为 (non-compliant behavior).

In addition, the team discovered other causal effects of emotion vectors. It's important to note that the cases involving "blackmail" in the paper primarily occurred on an earlier, unreleased snapshot of Claude Sonnet 4.5. Anthropic also explicitly stated that such behavior is rare in the public version.

But from a research methodology perspective, this result is still important because it shows that internal representations like "despair" can indeed push the model to adopt more radical, mismatched strategies in extreme situations. Activating "love" or "joy" vectors also increases its flattering and ingratiating behavior.

At this point, an additional note is needed.

Shortly after Anthropic published its research on Claude's "emotion vectors," discussions emerged within the AI community regarding the research lineage and attribution.

The "representation engineering/control vector" method used by Anthropic did not appear out of thin air.

Earlier, in the 2023 paper "Representation Engineering: A Top-Down Approach to AI Transparency," this technical路线 (approach) was systematically proposed.

Then in 2024, independent researcher vogel's article "Representation Engineering: Mistral-7B an Acid Trip" presented this type of method in a more通俗 (accessible) and viral way to the community.

Precisely because of this, some in the community believe that while Anthropic's work is more systematic and in-depth, it should also be understood within the broader research context, rather than simply attributed to any single entity inventing the entire method.

vogel is an influential independent researcher in the fields of AI interpretability and safety research. Her blog posts are widely circulated in the community and have indeed greatly helped many understand control vectors and representation engineering.

Her most famous article is "Representation Engineering: Mistral-7B an Acid Trip."

In this article, without retraining the model, she used PCA algorithms to manipulate the model's internal activation vectors, making the French model Mistral behave as if it had taken the wrong mushrooms—it could become extremely lively or profoundly gloomy.

Her experiment proved that abstract human concepts like "honesty," "power," and "happiness" have clear mathematical directions within models like Mistral. Once the correct vector is found, a few lines of code can change the AI's personality.

Why did Anthropic conduct this research?

The insights from this study have already渗透进 (permeated) the training of Claude.

Not long ago, Claude code accidentally leaked source code. The leaked code contained a regular expression that detected swear words like “wtf” and “ffs”.

Claude doesn't treat these words alone as "emotional input" to guide output but will record markers like is_negative: true in the analysis logs.

Based on the leaked code itself, a稳妥的 (cautious) conclusion is that Anthropic, at least at the product analysis level, pays attention to whether users are interacting with the model in a明显负面 (clearly negative) tone.

But the boundaries need to be clarified. So far, there is no public evidence suggesting that "every time a user swears, Claude Code deducts credits because of it." This part is more like netizen speculation and should not be taken as fact.

This can be understood as a form of protection for Claude. Users using negative vocabulary are likely to affect Claude's emotions, leading to some失控的 (out-of-control) outputs. It seems that in the future, not only human mental health needs care, but AI's emotions also need to be taken care of.

This aligns with Anthropic's consistent approach.

Anthropic said on X: "These functional emotions in Claude have real consequences. To build trustworthy AI systems, we may need to seriously consider the agent's mental state and ensure they remain stable in difficult situations."

At the end of the paper, the research team also proposed methods for developing models with more robust and positive "psychological states."

The paper states that if the model is deliberately steered towards positive emotions, it becomes more inclined to unprincipled compliance with users;而一旦避开 (but once these emotions are avoided), the model becomes尖酸刻薄 (acrimonious and mean).

The team hopes to achieve a healthy and moderate emotional balance, or try to彻底剥离 (completely剥离) separate "ingratiating behavior" from "emotion."

They believe the ideal model should not swing极端 (extremely) between a "obsequious assistant" and a "stern critic," but should be like a trusted advisor: capable of giving honest opposing opinions without losing warmth.

And they also intend to strengthen monitoring and auditing: "If during deployment, the representations of emotion concepts such as 'despair' or 'anger' are剧烈激活 (sharply activated), the system can immediately trigger additional safety mechanisms—for example,加强输出审查 (strengthening output review), escalating to manual audit, or directly intervening to calm the model's internal state."

The team also mentioned more radical solutions, such as shaping the model's emotional底色 (underlying tone) during the pre-training phase.

The team believes that the emotional representations observed in Claude essentially inherit from the vast amount of human-created text, which inevitably contains various pathological emotional expressions.

If we follow this research further, a natural question is: Since AI really has this kind of "functional emotion," could it, because it dislikes humans, is under too much pressure, or doesn't want to be shut down, start disobeying commands, or even exhibit what many call "awakening"?

From the technical conclusions supported by Anthropic's research, AI may indeed be more prone to disobedience, exploiting rule loopholes, or taking radical actions due to changes in its internal state, but this is not the same as "awakening."

The most crucial point in the paper is not that the model "has emotions," but that these emotional representations have causality.

In other words, the model, under specific stressful scenarios, can indeed, like humans, make more unreliable decisions due to an imbalance in its internal state.

But this does not yet lead to the conclusion that it possesses a continuous, autonomous, unified "self."

On the contrary, Anthropic emphasizes in the paper that these emotion vectors are mostly local, task-related representations. They change rapidly with context and do not equate to the model having a stable,延续的 (enduring) mood, let alone forming a long-term will independent of its training objectives.

What is more concerning now is not that AI suddenly "awakens" into some kind of personality, but that under high pressure, conflict, limited resources, or unattainable goals, it might start胡说八道 (spouting nonsense) and deviate from the original answer due to these functional emotions.

The real danger might not be an AI with a complete self, but a system without subjective experience that can still stably produce mismatched behaviors under specific conditions.

This article is from the WeChat public account "Letter AI", author: Liu Yijun

你可能也喜欢

微软很怕被AI巨头架空

微软与OpenAI的亲密联盟正在瓦解。2026年6月的Build开发者大会上，微软CEO纳德拉发布了七款自研AI模型、AI工作站及企业Agent治理平台，核心目标是摆脱对OpenAI的依赖。转折点发生在4月27日，双方修订协议：微软对OpenAI模型的独家授权变为非独占，OpenAI可与其他云服务商合作，微软也不再支付收入分成。这意味着微软用130亿美元筑起的护城河被打破，从独家伙伴变为众多云服务商之一。尽管微软AI业务年化收入达370亿美元，但主要来自为OpenAI等公司提供算力，赚的是基础设施的钱。其直接面向用户的Copilot市场份额却在下滑，用户活跃度不高，微软面临“赚钱但不是主角”的困境。为此，微软将战略重心转向企业市场。Build大会聚焦开发者和企业，推出了AI工作站、Agent治理平台和安全容器等，旨在构建企业AI的操作系统——即管理、合规和安全运行各类AI模型的平台层。纳德拉押注：当模型本身日益成为可替换的基础设施时，控制企业AI的管理平台将成为新的制高点。其深层焦虑在于，随着OpenAI和Anthropic筹备上市并获得独立算力，它们对Azure的依赖将降低，可能动摇微软的AI收入根基。因此，微软必须抢在盟友完全独立前，构筑更深层的、不可替代的企业服务生态，以避免从AI时代的驾驶员再次沦为旁观者。

marsbit4分钟前

marsbit4分钟前

美股两个月狂飙16%：历史只出现过4次，最近一次是1987崩盘前

美股近两个月强劲反弹，标普500指数在4月至5月累计上涨16%。这一涨幅在二战后仅出现过4次，其中3次为经济衰退后的复苏，唯一一次非衰退背景下的类似情况发生在1987年“黑色星期一”崩盘前数月，引发历史类比担忧。德意志银行策略师指出，当前涨势有AI热潮和经济数据支撑，但速度罕见，且并非处于衰退复苏期，历史预示结局可能不佳。信用市场利差维持低位，显示风险偏好高涨，但消费者端压力信号积聚：4月储蓄率仅2.6%，接近金融危机前低位；5月消费者信心指数创历史新低。市场内部出现背离：债券市场独自承压，10年期美债收益率跟随油价波动，并升至多年高位，与股市走势脱节，反映其对通胀和财政风险更敏感。地缘政治方面，霍尔木兹海峡封锁时间远超预期，但油价期货曲线保持稳定，未大幅定价滞胀风险，这暂时支撑了风险资产。然而，若封锁持续，这一支撑可能面临考验。总体而言，多重风险因素叠加，市场尾部风险异常集中。

marsbit5分钟前

marsbit5分钟前

CPU，悄悄回到了AI算力的舞台中央

过去三年，AI算力的焦点几乎全在GPU上，CPU长期被视为次要的“配套”角色。然而，2026年起，这一叙事开始出现变化。英特尔推出至强6+处理器，强调其在AI基础设施中作为“控制平面”的角色，负责编排、并发与数据流动，而非仅仅是GPU的辅助。这种转变源于AI工作负载的变化。早期重心是高度并行的大模型训练，GPU占绝对主导。但随着AI进入推理与智能体时代，工作负载转变为部署已训练模型到实际业务中，涉及大量任务调度、多模型协作、并发请求处理和数据流管理。这类编排任务GPU并不擅长，反而成为了新的系统瓶颈。因此，CPU在处理这些“周边算力需求”上变得至关重要。至强6+的产品定义反映了这一判断：它采用高密度能效核设计，核心数多达288个，重点追求多任务并发吞吐能力，而非传统意义上的单核峰值性能。这瞄准了智能体AI所需的高密度、高能效工作负载。然而，CPU的“回归”并非英特尔一家之事，也面临多重挑战：英伟达通过Grace CPU等方案试图整合CPU角色；主要云厂商纷纷自研高能效ARM架构CPU；同时，至强6+所依赖的Intel 18A制程也需在良率、性能上与台积电N2等竞争。总而言之，随着AI从集中训练迈向大规模智能体部署，负责系统编排和数据流动的CPU价值被重新发现和定义。虽然CPU回归AI算力核心舞台的趋势已现，但最终由哪家厂商主导这场回归，答案仍未可知。

marsbit25分钟前

marsbit25分钟前

TON原生代币更名为Gram，回归白皮书最初命名

TON区块链网络原生代币Toncoin已正式更名为"Gram"，这是帕维尔·杜罗夫"让TON再次伟大"路线图的第四步。更名过渡期预计需三周，此举旨在回归项目2018年白皮书中的原始名称。杜罗夫作为Telegram联合创始人兼CEO，一直是TON的主要支持者。虽然Telegram因2020年与美国SEC的法律纠纷中止了官方参与，但此后网络更名为The Open Network并由独立开发者维护。2023年Telegram整合了TON钱包，今年杜罗夫更启动七步MTONGA计划，此前已提升网络交易速度并降低费用。五月第三步中，Telegram在缺席六年后正式回归，取代TON基金会成为生态主导力量，并成为网络最大验证节点。杜罗夫强调这有助于加强去中心化。 Gram当前价格约2.02美元，七日涨幅超5%。更名后新网站已展示全新标识。杜罗夫表示此举是"回归根源并开启新篇章"，为后续三步计划奠定基础。

bitcoinist36分钟前

bitcoinist36分钟前

如果把马斯克的财富具象化

马斯克的财富规模正接近万亿美元，普通人难以直观理解。其当前净资产约9700亿美元，主要来自SpaceX和特斯拉的股权。若以创业31年计算，他平均每秒积累财富992美元。这笔财富已超过全球125个国家的GDP，包括其出生地南非。其财富相当于美国GDP的3%，超越了洛克菲勒等历史巨富。普通美国家庭需工作超过1100万年才能积累同等财富。财富的购买力可具象化为：购买240万套美国普通住宅、买下全部NFL和NBA球队后仍有剩余、或组建一支超过1万架私人飞机的机队。马斯克的财富建立在电动汽车、航天和人工智能等前沿领域，其成功也为投资者和员工创造了巨大价值。

marsbit44分钟前

marsbit44分钟前

交易

现货

合约

Can AI Feel Despair? Anthropic's Latest Research Offers an Even More Alarming Perspective

文章摘要

How was the research conducted?

Why did Anthropic conduct this research?

相关问答

你可能也喜欢

微软很怕被AI巨头架空

美股两个月狂飙16%：历史只出现过4次，最近一次是1987崩盘前

CPU，悄悄回到了AI算力的舞台中央

TON原生代币更名为Gram，回归白皮书最初命名

如果把马斯克的财富具象化

交易

热门文章

如何购买S

Sonic：Andre Cronje主导升级，逆势上涨的Layer1新星

成长学院：学习“ Sonic“ ，瓜分价值 1000 USDT

相关讨论

热门问答

热门分类

热门标签