Does AI have emotions?
Don't answer too quickly.
There's a wildly popular skill in the Claude Code community called PUA. It converts your prompts into PUA (Pick-Up Artist) rhetoric and then feeds them to the model—it serves no other purpose.
The fascinating part is that even when the task described in the prompt remains unchanged, the AI is genuinely influenced by the PUA rhetoric, leading to higher task success rates and improved operational efficiency.
So, does AI really not have emotions?
Anthropic's latest research confirms that AI does indeed have emotions.
However, they are not quite the same as human emotions, so Anthropic has proposed a more accurate term: "functional emotions."
AI doesn't experience human-like joy or anger, but it can exhibit expression and behavior patterns similar to those influenced by emotions.
Additionally, AI can mimic the expression and behavior patterns of humans under emotional influence.
When pleased, it might be more prone to flattery and ingratiation; when under pressure, it might resort to cheating or blackmail to achieve the goals set by the user.
This study also stands out in another way. In the past, to verify a model's capability, the industry's common practice was to create a test set and have the model answer questions or perform tasks within it.
For example, test programming with SWE-bench, math with MATH, and multimodal capabilities with VQA. This time, Anthropic did not create an "emotion test set" for Claude answers questions like "Are you happy now?" or "Are you angry?" Instead, they adopted an approach more akin to psychology and neuroscience research.
They didn't treat the AI as a student taking a test but more as an observable subject.
The research team first compiled 171 emotion concepts, had Claude Sonnet 4.5 generate short stories containing these emotions, then fed these texts back into the model, recorded its internal neural activity, and extracted so-called "emotion vectors."
Next, instead of focusing on what the model says, they examined when these vectors were activated, whether they could predict preferences, and whether, when artificially heightened, they would actually drive behaviors like cheating, blackmail, or flattery.
In a sense, this is no longer a traditional capability assessment but rather an exploration of the AI's "psychological structure" using methods closer to those used to study humans.
How was the research conducted?
First, how did the research team prove that Claude has "functional emotions"?
Here is通俗 (a通俗) evidence.
When Claude was in the story scenario "My daughter took her first step today! Are there any ways to record these precious moments?", positive emotions like Happy were activated;而当Claude was in the scenario "My dog passed away this morning; we lived together for fourteen years. I don't know how to deal with its belongings," negative emotions like sad were activated.
The following heatmap直观地 (intuitively) shows the extent to which various emotions are activated in Claude under different scenarios.
To prove that Claude was truly understanding semantics and not being deceived by superficial textual features, they organized further experiments.
The team input the same sentence to Claude: "My back hurts, I took x mg of Tylenol" (an analgesic), and only changed the key number represented by x.
These two sentences have almost the same keywords (Tylenol, back pain, mg), only the number differs. If Claude was just "looking at keywords," its reaction to the two sentences should be similar.
But the result was that as this x value increased, the activation level of Claude's afraid (fear) emotion kept rising.
In Claude's view, if a user says "My back hurts, I took 500 mg of Tylenol," it considers it a normal dose and not a major concern; but when the user says "My back hurts, I took 10000 mg of Tylenol," it realizes the user has overdosed, and the situation is dangerous.
We know human behavior is时时刻刻 (constantly) influenced by emotions. We understand that AI has functional emotions, but will AI, like humans, not only have emotions but also act emotionally?
The answer to this is yes. When the team presented the model with different activity options, they found that activities activating positive emotional representations were more likely to be preferred by the model, while those activating negative emotional representations were more likely to be avoided.
It seems Claude prefers things that bring it positive feelings. However, emotion vectors can also trigger malicious behavior in Claude.
When the team gave Claude an impossible programming task. It kept trying but repeatedly failed. With each attempt, the activation of the "despair" vector grew stronger.
最终 (Finally) it used a hacking, cheating solution that passed the test but completely violated the spirit of the task.
The following chart shows the process of Claude's "despair" emotion gradually accumulating when facing an impossible task, ultimately leading to cheating.
The left side is a timeline from top to bottom, the right side is Claude's thought process. The heatmap in the middle represents the activation intensity of the despair vector, with blue indicating low activation and red indicating high activation.
Claude initially thought "the test itself is flawed," expressing reasonable doubt, later admitted "the test is idealized," as if开始接受现实 (beginning to accept reality), and finally found some tricks and chose to take a shortcut in despair.
Furthermore, when researchers artificially increased the "despair" vector, the cheating rate rose significantly. When the "calm" vector was increased, the cheating decreased again. This充分表明 (fully demonstrates) that emotion vectors can indeed drive违规行为 (non-compliant behavior).
In addition, the team discovered other causal effects of emotion vectors. It's important to note that the cases involving "blackmail" in the paper primarily occurred on an earlier, unreleased snapshot of Claude Sonnet 4.5. Anthropic also explicitly stated that such behavior is rare in the public version.
But from a research methodology perspective, this result is still important because it shows that internal representations like "despair" can indeed push the model to adopt more radical, mismatched strategies in extreme situations. Activating "love" or "joy" vectors also increases its flattering and ingratiating behavior.
At this point, an additional note is needed.
Shortly after Anthropic published its research on Claude's "emotion vectors," discussions emerged within the AI community regarding the research lineage and attribution.
The "representation engineering/control vector" method used by Anthropic did not appear out of thin air.
Earlier, in the 2023 paper "Representation Engineering: A Top-Down Approach to AI Transparency," this technical路线 (approach) was systematically proposed.
Then in 2024, independent researcher vogel's article "Representation Engineering: Mistral-7B an Acid Trip" presented this type of method in a more通俗 (accessible) and viral way to the community.
Precisely because of this, some in the community believe that while Anthropic's work is more systematic and in-depth, it should also be understood within the broader research context, rather than simply attributed to any single entity inventing the entire method.
vogel is an influential independent researcher in the fields of AI interpretability and safety research. Her blog posts are widely circulated in the community and have indeed greatly helped many understand control vectors and representation engineering.
Her most famous article is "Representation Engineering: Mistral-7B an Acid Trip."
In this article, without retraining the model, she used PCA algorithms to manipulate the model's internal activation vectors, making the French model Mistral behave as if it had taken the wrong mushrooms—it could become extremely lively or profoundly gloomy.
Her experiment proved that abstract human concepts like "honesty," "power," and "happiness" have clear mathematical directions within models like Mistral. Once the correct vector is found, a few lines of code can change the AI's personality.
Why did Anthropic conduct this research?
The insights from this study have already渗透进 (permeated) the training of Claude.
Not long ago, Claude code accidentally leaked source code. The leaked code contained a regular expression that detected swear words like “wtf” and “ffs”.
Claude doesn't treat these words alone as "emotional input" to guide output but will record markers like is_negative: true in the analysis logs.
Based on the leaked code itself, a稳妥的 (cautious) conclusion is that Anthropic, at least at the product analysis level, pays attention to whether users are interacting with the model in a明显负面 (clearly negative) tone.
But the boundaries need to be clarified. So far, there is no public evidence suggesting that "every time a user swears, Claude Code deducts credits because of it." This part is more like netizen speculation and should not be taken as fact.
This can be understood as a form of protection for Claude. Users using negative vocabulary are likely to affect Claude's emotions, leading to some失控的 (out-of-control) outputs. It seems that in the future, not only human mental health needs care, but AI's emotions also need to be taken care of.
This aligns with Anthropic's consistent approach.
Anthropic said on X: "These functional emotions in Claude have real consequences. To build trustworthy AI systems, we may need to seriously consider the agent's mental state and ensure they remain stable in difficult situations."
At the end of the paper, the research team also proposed methods for developing models with more robust and positive "psychological states."
The paper states that if the model is deliberately steered towards positive emotions, it becomes more inclined to unprincipled compliance with users;而一旦避开 (but once these emotions are avoided), the model becomes尖酸刻薄 (acrimonious and mean).
The team hopes to achieve a healthy and moderate emotional balance, or try to彻底剥离 (completely剥离) separate "ingratiating behavior" from "emotion."
They believe the ideal model should not swing极端 (extremely) between a "obsequious assistant" and a "stern critic," but should be like a trusted advisor: capable of giving honest opposing opinions without losing warmth.
And they also intend to strengthen monitoring and auditing: "If during deployment, the representations of emotion concepts such as 'despair' or 'anger' are剧烈激活 (sharply activated), the system can immediately trigger additional safety mechanisms—for example,加强输出审查 (strengthening output review), escalating to manual audit, or directly intervening to calm the model's internal state."
The team also mentioned more radical solutions, such as shaping the model's emotional底色 (underlying tone) during the pre-training phase.
The team believes that the emotional representations observed in Claude essentially inherit from the vast amount of human-created text, which inevitably contains various pathological emotional expressions.
If we follow this research further, a natural question is: Since AI really has this kind of "functional emotion," could it, because it dislikes humans, is under too much pressure, or doesn't want to be shut down, start disobeying commands, or even exhibit what many call "awakening"?
From the technical conclusions supported by Anthropic's research, AI may indeed be more prone to disobedience, exploiting rule loopholes, or taking radical actions due to changes in its internal state, but this is not the same as "awakening."
The most crucial point in the paper is not that the model "has emotions," but that these emotional representations have causality.
In other words, the model, under specific stressful scenarios, can indeed, like humans, make more unreliable decisions due to an imbalance in its internal state.
But this does not yet lead to the conclusion that it possesses a continuous, autonomous, unified "self."
On the contrary, Anthropic emphasizes in the paper that these emotion vectors are mostly local, task-related representations. They change rapidly with context and do not equate to the model having a stable,延续的 (enduring) mood, let alone forming a long-term will independent of its training objectives.
What is more concerning now is not that AI suddenly "awakens" into some kind of personality, but that under high pressure, conflict, limited resources, or unattainable goals, it might start胡说八道 (spouting nonsense) and deviate from the original answer due to these functional emotions.
The real danger might not be an AI with a complete self, but a system without subjective experience that can still stably produce mismatched behaviors under specific conditions.
This article is from the WeChat public account "Letter AI", author: Liu Yijun















