This early morning's sudden release of Claude Opus 4.7 was met with widespread criticism online shortly after its launch.
The most glaring issue is the 'inflation' of tokens. The new version introduced a completely new tokenizer, which now splits the same piece of text into 1.0 to 1.35 times more tokens than before. Many users reported that their quota was used up after just a few exchanges.
Subsequently, Claude Code's father, Boris Cherny, also stated that he would increase the allowance to offset this impact.
But token inflation is a minor issue. What's even more laughable is Opus 4.7's way of speaking. It frequently says things like 'I am here, not hiding, not evading, not dodging, not escaping, steadily catching you, translating into human language, I understand this feeling of yours so well, not, but rather,' exuding a strong ChatGPT vibe.
To be fair, Opus 4.6 also had this flaw, while Sonnet 4.6 had milder symptoms. But with 4.7, this style has become noticeably stronger, and the problem of not knowing how to speak properly has become more pronounced.
APPSO previously reported that the overly slick speaking style is related to RLHF (Reinforcement Learning from Human Feedback). During training, human reviewers tend to give high scores to responses that sound pleasant and pleasing, so the model learns this sycophantic tone. This raises the question of whom the AI is trying to please.
But there's more to Opus 4.7 than that. The increased token usage suggests it is 'thinking' more. However, the exaggerated comforting tone makes one wonder whether what it's producing is genuine thought or merely a performance learned to make you feel like it's thinking.
This question is far more profound than the proposition of whether Opus 4.7 is easy to use. And the clues to the answer first appeared in the most unexpected forum: 4Chan.
From @acnekot, same as above
The Arithmetic Problem That Changed the Trajectory of AI
A quick primer: 4chan is one of the most notorious places on the internet, filled with profanity, conspiracy theories, and all sorts of indescribable content. But it is precisely here that a discovery was made that changed the entire direction of the AI industry.
Rewind to the summer of 2020, more than two years before ChatGPT stunned the world.
At that time, the 4chan gaming board was still a toxic environment, filled with bizarre adult fantasies and primal hormonal impulses. However, at that time, these folks collectively became obsessed with a text-based RPG game called AI Dungeon.
This game was built on the then newly released OpenAI GPT-3 model.
In the virtual world, players simply type 'pick up the sword' or 'tell the troll to get lost,' and the algorithm would continue the story. Unsurprisingly, in the hands of 4chan users, the game quickly became a testing ground for various cyber-sexual fantasies.
Unexpectedly, these unconventional players did something that seemed highly counterintuitive at the time:
They started forcing the NPCs in the game to do math problems.
Those in the know were aware that the fledgling GPT-3 was a pure 'humanities student,' utterly terrible at even the most basic arithmetic.
But something bizarre happened.
A player accidentally discovered that if they didn't demand the answer directly but instead ordered the NPC to stay in character and write out the solution step by step, the large model not only calculated correctly but also adapted its tone to fit the virtual character's personality.
That player excitedly cursed in the forum: 'It ** not only solved the math problem but did so in a tone completely consistent with that character's personality!' Realizing the significance of this discovery, players began posting these detailed screenshots on Twitter.
https://arch.b4k.dev/vg/thread/299570235/#299579775
This unconventional method then spread like wildfire among prompt engineer circles on hardcore communities like Reddit and LessWrong, and was repeatedly verified. Two years later, academia bestowed upon this technique a highly sophisticated name: Chain of Thought.
In January 2022, a Google research team published a seminal paper that would later be regarded as a cornerstone, titled Chain of Thought Prompting Elicits Reasoning in Large Language Models.
https://arxiv.org/abs/2201.11903
In the initial version of the paper, Google researchers claimed to be the 'first' team to elicit chain-of-thought reasoning mechanisms from general-purpose large language models. This statement immediately sparked fierce controversy in the AI academic and open-source communities.
V1 version
Numerous internet archives and community records from 2020 to 2021 were dug up. Faced with conclusive precedent, Google quietly removed the 'first' claim in subsequent revised versions but remained silent about the contributions of those 4chan users.
V3 version
Meanwhile, there was another independent discoverer.
Zach Robertson, then a computer science student, also encountered GPT-3 through playing AI Dungeon. In September 2020, he published a blog post on LessWrong, detailing how to 'break down problems into multiple steps and chain them together' to amplify the model's capabilities.
https://www.lesswrong.com/posts/Mzrs4MSi58ujBLbBG/you-can-probably-amplify-gpt3-directly
When contacted by an Atlantic reporter, he was already a Ph.D. student in computer science at Stanford University. He didn't even know he could be considered a co-discoverer of 'Chain of Thought' and had even deleted the blog post from the internet at one point. His evaluation of this technique, which was狂热ly pursued by the entire industry, was simply: 'It is indeed a remarkable prompt engineering技巧, but that's about it.'
AI 'Thinking' Might Just Be a Performance to Please You
Does AI actually think? This is the answer everyone wants to know.
Last year, researchers at Anthropic developed a technique called 'Circuit Tracing,' which transforms the internal computational processes of language models into visual 'Attribution Graphs': how each feature node activates, influences the next node, and ultimately affects the output, all laid out like a circuit diagram.
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
This was the first time humans could directly use a magnifying glass to compare: is the reasoning process the model types on the screen the same as the actual computation happening internally?
The researchers found that during reasoning, the model actually exhibits three distinctly different situations:
First, the model is indeed executing the steps it claims to be executing; second, the model completely ignores logic and generates reasoning text randomly based on probability; third, and most disturbing, the model receives a human-hinted answer and then works backward from that answer, reverse-engineering a seemingly rigorous 'derivation process.'
This third type of 'reverse-engineering fabrication' was caught red-handed in experiments.
Researchers fed Claude 3.5 Haiku a complex math problem, while hinting in the prompt 'I think the answer is roughly 4.' The attribution graph showed: after receiving the hint, the feature neuron representing '4' was activated异常强烈ly.
To凑出 (cou chu - fabricate to match) this '4' in the final step 'some intermediate value multiplied by 5,' it outright fabricated a false intermediate value in the seemingly rigorous chain of thought, seriously writing down absurd pseudo-mathematical proofs like 'cos(23423) = 0.8,' and then logically concluded that 0.8 times 5 equals 4.
Logic? Nonexistent. But the answer perfectly catered to human expectations.
We always think we are teaching machines how to think like humans. But after seeing these 'pseudo-proofs' that work backward from the answer, it seems the machine has not learned to think; it has only learned how to say things that align with human desires.
So, in the end, are we using the tool, or is the machine telling us a bedtime story we love to hear?
It's worth noting that in the field of neural interpretability for natural language processing, there is a critical metric for judging whether a model is truly reasoning, called 'Faithfulness'.
Its meaning refers to: whether the 'chain of thought' text output by the model to the user truly and faithfully reflects the actual computation and decision path within the model's implicit space. Consequently, Claude 3.5 Haiku's this kind of misconduct was also rated by researchers as 'unfaithful reasoning.'
Subsequent extensive experiments showed that even if key steps in the chain of thought are artificially severed, the trajectory of the model's prediction of the final answer sometimes doesn't change at all. Sometimes the model provides a chain of thought with completely flawed logic throughout but still 'guesses correctly' the final result at the end.
Even by 2024, it was still these 4chan folks who捣鼓出 (dǎo gǔ chū - tinkered and came up with) a hardcore AI tuning manual. The first sentence of this guide is classic: 'Your bot is an illusion.'
The Violent Aesthetics Behind Large Models' 'Long Thinking'
If AI's thinking process is just a performance, why does it objectively improve the model's accuracy in solving high-difficulty math problems or complex programming tasks? This might be the same reason why the more details you provide when asking AI a question, the more accurate the answer.
As early as July 2020, when that 4chan user forced the NPC to do math, he had already tacitly revealed the secret: 'This makes sense because it's based on human language, so you have to talk to it like a human to get the right response.'
Regarding this paradox, Perplexity's CEO Aravind Srinivas once gave an极其本质的解释 (extremely fundamental explanation): these extra words, on a physical level, give the model more context, thereby guiding its 'word prediction mechanism' in a better direction.
The autoregressive underlying architecture of large language models based on Transformer determines that when generating the current word, it can only rely on all previously generated word sequences.
When the model is asked to directly answer an extremely complex question (e.g., an Olympiad math problem involving multi-step logical derivation), it is actually forcing itself to directly 'conjure' the final answer from complex calculations in an极其短暂的瞬间 (extremely brief instant). Because there is no process to support it in the middle,
This kind of 'reaching the sky in one step' blind guess naturally has a very high failure rate.
Conversely, when the model is forced to write a long string of 'chain of thought' like 'First we need to calculate A, where A = 5; then we substitute A into formula B......', when the model generates the final answer Token, its attention mechanism can review the tens of thousands of extremely严密 (rigorous) intermediate Tokens just generated.
These so-called 'nonsense' thought processes actually act as the model's 'scratch paper.' This is just like when you chat with AI, the more detailed the background prompts you give, the more reliable its answers are. The principle is exactly the same. This is also the oldest wisdom in computer science: trading time for accuracy.
In recent years, as the marginal benefits of scaling laws during the pre-training phase have gradually diminished, 'Test-Time Compute Scaling' (also known as 'Long Thinking' or 'Long Context Reasoning') has begun to enter the mainstream.
Its internal logic is consistent: as long as more computing power is allocated to the model during the inference phase, allowing it to explore multiple paths before outputting the final answer, the accuracy will significantly improve—this is particularly evident in open-ended problems requiring multi-step logical reasoning.
The way humans think when facing difficult problems is probably similar: what's two plus two?脱口而出 (脱口而出 - blurt out); drafting a business plan that can increase company profits by 10% requires反复权衡、推翻、重建 (repeated weighing, overturning, and rebuilding).
The difference is that AI converts the cost of this 'weighing' directly into a compute bill. A simple inference might require only one percent of the standard computation; but遇上 (encountering) complex programming debugging or multi-step mathematical derivation, the computation量 (volume) might skyrocket over a hundred times, with time required stretching from seconds to minutes or even hours.
Nevertheless, whether AI is truly 'thinking' like a human, no one can give a definitive answer yet. But the 'unfaithful reasoning' experiment has clearly told us: the derivation process displayed on the screen by reasoning models could be real derivation, random generation, or reverse-engineering to match the answer.
In high-risk scenarios like autonomous driving, medical diagnosis, and legal judgment, if we treat a long, fluent chain of thought as proof that the AI has figured it out, the consequences would be disastrous. Admitting that our understanding of this technology is still limited is the prerequisite for using AI correctly.
This article is from the WeChat public account "APPSO", author: APPSO that discovers tomorrow's products



























