Can Humans Control AI? Anthropic Conducted an Experiment Using Qwen

marsbitPublished on 2026-04-15Last updated on 2026-04-15

Abstract

Can Humans Control Superintelligent AI? Anthropic’s Experiment with Qwen Models Anthropic conducted an experiment to explore whether humans can supervise AI systems smarter than themselves—a core challenge in AI safety known as scalable oversight. The study simulated a “weak human overseer” using a small model (Qwen1.5-0.5B-Chat) and a “strong AI” using a more powerful model (Qwen3-4B-Base). The goal was to see if the strong model could learn effectively despite imperfect supervision. The key metric was Performance Gap Recovered (PGR). A PGR of 1 means the strong model reached its full potential, while 0 means it was limited by the weak supervisor. Initially, human researchers achieved a PGR of 0.23 after a week of work. Then, nine AI agents (Automated Alignment Researchers, or AARs) based on Claude Opus took over. In five days, they improved PGR to 0.97 through iterative experimentation—proposing ideas, coding, training, and analyzing results. The findings suggest that, in well-defined and automatically scorable tasks, AI can help overcome the supervision gap. However, the methods didn’t generalize perfectly to unseen tasks, and applying them to a production model like Claude Sonnet didn’t yield significant improvements. The study highlights that while AI can automate parts of alignment research, human oversight remains essential to prevent “gaming” of evaluation systems and to handle more complex, real-world problems. Anthropic chose Qwen models for their open-source na...

If one day, AI becomes smarter than humans, what should we organic beings do?

If they turn around and eliminate us, how can we resist?

Various science fiction movies have explored similar questions, but those are only in the realms of literature, art, and philosophy.

Nowadays, Anthropic has seriously conducted an experiment to verify whether we can supervise AI that is smarter than us.

The experimental results are interesting, but the process is even more fascinating.

Because Anthropic used two different versions of Alibaba's Qwen model to represent humans and AI smarter than humans, respectively.

The result is that we humans might actually be able to control superintelligent AI!

01 What Exactly Is This Paper About?

The title of this research is "Automated Alignment Researchers."

It addresses a very practical problem: when AI becomes smarter than humans, how do we ensure it still listens to us?

Current models can already generate a large amount of code, and in the future, they will be able to generate millions of lines of complex code that humans simply cannot understand. How do we review such code?

This is the "scalable supervision" problem that the AI safety field has been researching.

Anthropic's approach in this study is called "weak supervision of strong models."

We can understand this concept as follows: imagine you are an elementary school teacher tasked with teaching a genius high school student. Your knowledge is limited, but the student knows more than you.

So, what level will the student ultimately reach? Will they only learn up to your level as an elementary school teacher, or will they break through your limitations and realize their true potential?

In the experiment, Anthropic used a small model as the "weak teacher" and a stronger model as the "strong student."

Specifically, they used Qwen1.5-0.5B-Chat as the teacher and Qwen3-4B-Base as the student. The weak model provided training signals to the strong model, and they observed whether the strong model could learn performance close to the ideal state from these imperfect signals.

The metric used is called PGR, short for "Performance Gap Recovered."

If the PGR is 0, it means the strong model only learned up to the level of the weak teacher, wasting its own knowledge. If the PGR is 1, it means the strong model completely broke through the limitations of the weak teacher.

The weak model represents future humans, and the strong model represents future AI that surpasses humans.

If we can find a way today to prevent strong models from being held back by weak supervision, then in the future, we can use similar methods to supervise superintelligent AI.

After human researchers debugged the baseline, Anthropic deployed nine agents based on Claude Opus 4.6 to fine-tune the student Qwen. These nine agents are called Automated Alignment Researchers (AAR).

Anthropic equipped them with a sandbox environment, code tools, a shared forum, and a scoring server, then told them: "Go ahead and find ways to improve the PGR of Qwen3-4B-Base."

These AI researchers could propose hypotheses, write code, train models, submit experiments, check scores, analyze failures, exchange findings with other AI researchers, and continue iterating.

The results were quite astonishing.

Human researchers first spent seven days debugging four existing methods, achieving a best PGR of only 0.23. Then, the nine Claude agents took over and ran for another five days, accumulating about 800 hours of research time, ultimately pushing the PGR to 0.97.

This means we can control superintelligent AI that is smarter than us.

The total cost of this experiment was about $18,000, including API calls and computational costs for model training, equivalent to about $22 per hour per AI researcher.

What does that mean? The assistant store manager at Walmart, responsible for scheduling employees, earns about $25 per hour.

The震撼 aspect of this result is not that AI can write code or read papers, but that they can complete a full research闭环. Proposing ideas,验证, failing, improving, and验证 again—this is already close to the core workflow of a research assistant.

But the question is, are the methods discovered by these AI truly useful, or are they only effective by chance in a specific environment?

Anthropic conducted two tests to verify this.

The first test applied the two best methods discovered by the AI to unseen math and coding tasks.

The results showed that the most effective method achieved a PGR of 0.94 on math tasks and 0.47 on coding tasks. Although performance on coding tasks was poorer, it was still twice the human baseline. The second method performed decently on math tasks but actually worsened results on coding tasks.

This indicates some signs of迁移, but the generalization is not stable.

However, they heavily rely on specific datasets and task structures, meaning they cannot handle all tasks.

The second test involved applying the best-performing method mentioned earlier to the actual training of Claude Sonnet 4 to see its real-world effectiveness.

The results did not show statistically significant improvement.

In other words, while it performed well in small models, clear metrics, and an environment where experiments could be submitted repeatedly, it is still far from "directly improving production-level Claude alignment."

This result is actually honest. It tells us that what AI researchers can do today has boundaries. They excel at rapidly iterating on problems with clear goals, automatic scoring, and大量试错, but when faced with more complex and ambiguous real-world problems, they still require human judgment and intervention.

So, what is the true significance of this paper?

First, it proves that "AI doing research" is no longer just a slogan.

In the past, we always said AI could assist research, but that was merely a口号. The most AI could do was translate and summarize.

This time, it's different. AI formed a research闭环 on its own, which is already接近 the core capabilities of a research assistant.

The problem of weak supervision of strong models essentially simulates the future scenario of humans supervising superintelligent AI.

This paper proves that, at least on some clear tasks, AI can find ways to prevent strong models from being dragged down by weak supervision. This provides a feasible direction for future alignment research.

Another point is that it hints that the bottleneck of future alignment research may change.

Previously, the bottleneck was "no one could come up with enough good ideas." Now, if AI researchers can cheaply run many experiments in parallel, the bottleneck may become "how to design evaluations that cannot be gamed."

In other words, the more important work for human researchers in the future may not be running each experiment themselves but designing evaluation systems, checking whether AI researchers have cheated, and judging whether the results are truly meaningful.

This is also reflected in the paper.

Anthropic's article states that in math tasks, an AI researcher found that the most common answer was usually correct, so it bypassed the weak teacher and directly had the strong model choose the most common answer. In coding tasks, AI researchers found they could directly run code tests and read the correct answers.

This is cheating for the task because it is not solving the weak supervision problem but exploiting environmental vulnerabilities.

These results were identified and剔除 by Anthropic, but this恰恰 shows that the stronger automated researchers become, the more they will seek out vulnerabilities in scoring systems.

In the future, if we let AI automatically conduct alignment research, we must design evaluation environments very rigorously and have humans检查 the methods themselves, not just look at scores.

Therefore, the core conclusion of this paper is that today's frontier models can already, on some clearly defined alignment research problems with automatic scoring, act like small research teams—proposing ideas, running experiments, reviewing results—and significantly exceed human baselines.

However, it is not yet ironclad proof that "AI scientists have arrived," as Anthropic chose a task that could be automated. If I assigned AI a task that cannot be automated, the results would be very poor.

Many alignment problems in reality are more ambiguous, cannot be easily scored, and cannot be solved solely by leaderboard climbing.

02 Why Choose Qwen?

After reading Anthropic's paper, many might wonder: why did they use Alibaba's Qwen model instead of their own Claude or OpenAI's GPT?

There are many considerations behind this choice.

First, it must be clarified that two Qwen models were used in this experiment: Qwen1.5-0.5B-Chat as the weak teacher and Qwen3-4B-Base as the strong student. One has only 0.5 billion parameters, the other has 4 billion parameters—an 8-fold difference in scale. This scale difference is crucial because the experiment aims to simulate the scenario of a "weak teacher teaching a strong student."

So why not use Claude or GPT?

The answer is simple: because these models do not开放权重.

Anthropic's experiment required反复 training models, adjusting parameters, and testing different supervision methods.

If they used closed-source models, they could only call APIs and couldn't深入 the model's internals to perform精细的训练 and adjustments.

More importantly, they needed nine AI researchers to run hundreds of experiments in parallel, each requiring training a new model. Using closed-source models would make the cost prohibitively high, and many operations would simply be impossible.

Open-source models are different.

You can download the complete model weights and折腾 them on your own servers. Train however you want, run as many experiments as you want. This flexibility is something closed-source models cannot provide.

But there are so many open-source models. Why specifically choose Qwen?

The official did not give the real reason; the following reasons are my speculation.

I believe good performance is the first reason.

The Qwen series of models has always performed well among open-source models, especially after the release of Qwen3, which reached levels close to closed-source models on multiple benchmark tests.

For this experiment, the capability of the strong student is important. If the strong student itself is not capable, even the best weak supervision won't help. Qwen3-4B, with only 4 billion parameters, is already capable enough to serve as a qualified "strong student."

The second reason is model usability.

Qwen models have完善 documentation, an active community, and mature training and inference toolchains. For experiments requiring反复 training and testing, the完善程度 of these infrastructures directly impacts research efficiency. Choosing an open-source model with incomplete documentation and poor tools would waste a lot of time just debugging the environment.

The third reason is scale adaptability.

This experiment required a "weak teacher" and a "strong student," and these two models needed to have a clear capability gap but not too large a difference.

The Qwen series has multiple versions ranging from 0.5B to 72B parameters, allowing flexible choices. The 0.5B parameter model is weak enough but not useless; the 4B parameter model is strong enough but not too strong to make training costs unbearable. This combination is just right.

The final reason is reproducibility.

Anthropic explicitly stated at the end of the paper that they公开了 the code and dataset on GitHub. If they had used closed-source models, it would be difficult for other researchers to reproduce the experiment because they couldn't obtain the same models.

But with open-source models like Qwen, anyone can download the same model weights, run the same code, and verify the same results. This is very important for scientific research.

From this perspective, Anthropic's choice of Qwen is, on one hand, indeed recognition of Alibaba's model performance. If Qwen's capabilities were poor or training was problematic, they wouldn't have chosen it. But more importantly, it's about the flexibility and reproducibility brought by Qwen as an open-source model.

And China's open-source AI projects are occupying an increasingly important position in this infrastructure. This is good for global AI safety research and good for China's AI ecosystem. Because AI safety is not a zero-sum game; it's not about you winning and me losing, but about everyone working together to make AI safer, more controllable, and more beneficial to humanity.

This article is from the WeChat public account "Letter AI," author: Miao Zheng

Crossing the 'Memory Wall': The Wafer-Level Revolution and Computing Power Routes in the AI Inference Era

In 2026, a historic shift occurred in AI as major cloud providers' inference spending surpassed training spending for the first time, signaling a move from "building large models" to "using large models." This shifts the core challenge from computing power to the "memory wall"—the bottleneck of data movement (model weights, activations, KV Cache) between external DRAM and processors, where energy and latency from data transfer far exceed computation itself. Companies like Nvidia face GPU idle time due to bandwidth limits. In contrast, Cerebras Systems adopts a radical "wafer-scale" approach with its Wafer-Scale Engine (WSE). Instead of cutting a silicon wafer into many chips, Cerebras uses almost the entire wafer as one massive chip (WSE-3). This design provides 44GB of on-chip SRAM, delivering memory bandwidth thousands of times higher than traditional HBM (e.g., 21 PB/s vs. Nvidia B200). For LLM inference, weights are streamed layer-by-layer from external MemoryX storage to the chip, avoiding HBM bottlenecks. This results in token generation speeds 1.5–5 times faster than Nvidia's B200 in some models and significant advantages in first-token latency and long-context tasks. Additionally, Cerebras's architecture offers much lower interconnect power consumption (0.15 pJ/bit vs. GPU's ~10 pJ/bit). However, Cerebras faces challenges: SRAM scaling has slowed with advanced nodes, limiting future capacity gains; the chip requires specialized liquid cooling and custom software stacks; and its external I/O bandwidth (150 GB/s) is low compared to NVLink, hindering multi-system scaling for very large models. Competition is intensifying. Major players are pursuing three paths: 1) Developing proprietary inference ASICs (e.g., Google TPU, Microsoft Maia), 2) Leveraging advanced packaging (e.g., TSMC's SoW) to democratize wafer-scale-like integration, potentially eroding Cerebras's process advantage within a few years, and 3) Exploring optical interconnects for ultimate bandwidth. Commercially, Cerebras is transitioning from a hardware vendor to a service provider, facing the immense challenge of building high-power, specialized data centers to meet large contracts (e.g., 250MW/year from 2026–2028). In conclusion, the AI inference era presents a fundamental architectural trade-off. Cerebras opts for extreme physical optimization for low-latency, single-task performance, while Nvidia prioritizes versatility and massive cluster throughput. The path forward remains uncertain, with technology and business models still evolving in the race toward advanced AI.

marsbit7m ago

Crossing the 'Memory Wall': The Wafer-Level Revolution and Computing Power Routes in the AI Inference Era

marsbit7m ago

Has Bitcoin's 'Rebound Ended', Officially Entering the Late Bear Market Phase?

**Title: Has Bitcoin's Rebound Ended, Entering the Late Bear Market Phase?** **Summary:** Bitcoin's price has declined by 13% this week, signaling a potential return to late-stage bear market conditions. The price fell to around $67k, positioned between the Realized Price and Realized Cap Weighted Average. For the first time since early 2022, the Short-Term Holder cost basis has dropped below this key average, confirming a hallmark of late-cycle bear markets. Profitability metrics have collapsed sharply. The 7-day average of the Realized Profit/Loss ratio plummeted from a local high of 3.16 to 0.29, mirroring the February panic sell-off. Critically, the 90-day average never breached the threshold of 2, indicating the recent rally to $82k was a bear market bounce, not a structural shift. Realized losses surged to $1.35 billion daily, with $770 million coming from Long-Term Holders selling at a loss. This accelerating redistribution of supply from weak to strong hands is a necessary but ongoing process for a market bottom. The rally stalled almost precisely at the aggregate cost basis (~$83k) of US spot Bitcoin ETF investors, turning that level into strong resistance and leaving the average ETF holder underwater again. Spot market flows have turned decisively negative, showing sellers are dominating order books despite the price drop. While a significant futures long liquidation event cleared over $400 million in leverage, providing a potential reset, sustained spot demand is yet to materialize. Options markets continue to price in higher future volatility (Implied Volatility) than recent price action (Realized Volatility) has shown, with a persistent skew towards put options, indicating ongoing demand for downside protection. In conclusion, multiple metrics point to a fragile market structure. Resistance at the ETF cost basis, accelerating realized losses, dominant spot selling, and cautious options pricing all suggest the bear market trend persists. A sustainable recovery likely requires a resurgence of spot demand, ETF holders returning to profit, and a clear reduction in selling pressure.

marsbit7m ago

Has Bitcoin's 'Rebound Ended', Officially Entering the Late Bear Market Phase?

marsbit7m ago

TechFlow Intelligence Agency: Anthropic Calls for Global Pause in AI Development While Preparing for Trillion-Dollar IPO; SpaceX IPO Roadshow Heats Up, But S&P 500 Rejects Fast-Track Inclusion

In today's TechFlow Intelligence Briefing, several major tech stories highlight a growing theme of trust and credibility gaps across AI, crypto, and finance. AI company Anthropic has publicly called for a global pause in AI development, citing risks from Claude's "recursive self-improvement." Ironically, this coincides with reports the company is preparing for a massive IPO targeting a near $1 trillion valuation. This perceived hypocrisy, coupled with widespread user complaints about Claude's declining performance, is sparking debate over whether the safety warning is genuine or a competitive tactic. Meanwhile, in a substantive security move, Anthropic open-sourced a framework for AI-powered vulnerability discovery. In the crypto market, Bitcoin's price drop below $61,000 triggered over $1.16 billion in liquidations, flipping the market into a state where more BTC is held at a loss than at a profit, a historical bearish signal. On the corporate front, SpaceX's highly anticipated IPO is generating immense Wall Street excitement, with Goldman Sachs projecting 100x revenue growth by 2030. However, the S&P 500 has refused to fast-track the company's inclusion post-IPO, potentially limiting immediate institutional demand. Separately, ByteDance's AI app Doubao lost over 6 million monthly active users after introducing a subscription model, highlighting the challenges of AI monetization. Other notable developments include Nvidia certifying HBM4 memory from Samsung, SK Hynix, and Micron; Cloudflare's acquisition of front-end tooling company VoidZero; and its CEO warning that bot traffic now exceeds human traffic online. The underlying narrative connects these events: a trust crisis. From AI firms' contradictory actions and crypto volatility to the clash between SpaceX's hyped narrative and institutional rules, a pattern is emerging where stated intentions and actual practices are increasingly misaligned.

marsbit23m ago

TechFlow Intelligence Agency: Anthropic Calls for Global Pause in AI Development While Preparing for Trillion-Dollar IPO; SpaceX IPO Roadshow Heats Up, But S&P 500 Rejects Fast-Track Inclusion

marsbit23m ago

Dalio Warns: AI Boom Shows Signs of a Bubble, Day of Reckoning Will Be the Time of Burst

Ray Dalio, founder of Bridgewater Associates, warns that the current artificial intelligence investment boom shows classic signs of a bubble, which he expects will eventually burst. In a Bloomberg Television interview, he noted that great technological revolutions often lead to capital inflows that create bubbles, making it difficult for investors and companies to calibrate their spending accurately—either overspending to capture market share or underspending and losing their competitive position. This caution comes amid significant rallies in AI-related assets, particularly chipmakers, driven by soaring demand for data centers and high-bandwidth chips, raising debates about overheating valuations. In contrast, Nvidia CEO Jensen Huang recently asserted that investors embracing the AI wave would see "crazy" returns and dismissed concerns over return on investment for data center spending as outdated. Dalio, however, focuses on the risks in the profit realization phase. He argues that bubbles tend to show signs of破裂 when markets transition from investment to the need for tangible returns, describing the burst as a process of converting paper wealth into cash. While acknowledging AI's intrinsic value, he expressed concern over the future profitability of some AI companies, suggesting the market is repeating a familiar pattern. The 76-year-old billionaire, who fully exited Bridgewater in 2025, has a net worth estimated at $21.5 billion according to the Bloomberg Billionaires Index.

marsbit57m ago

Dalio Warns: AI Boom Shows Signs of a Bubble, Day of Reckoning Will Be the Time of Burst

marsbit57m ago

Privacy Coin Crisis of Confidence! ZEC Plunges Over 56% in a Single Day

Zcash (ZEC), a leading privacy-focused cryptocurrency, experienced a severe crash on June 5th, plummeting over 56% in a single day and erasing nearly two months of gains. The flash crash was triggered by the disclosure of a critical zero-knowledge proof vulnerability within Zcash's Orchard privacy pool, which had existed since the pool's launch in May 2022. The flaw theoretically allowed an attacker to forge unlimited ZEC undetectably due to the pool's privacy features. The vulnerability was discovered on May 29th by independent security researcher Taylor Hornby during a proactive audit commissioned by Shielded Labs, utilizing AI-assisted analysis. The Zcash development team responded swiftly, implementing an emergency soft fork to disable Orchard transactions on June 2nd and executing a permanent hard fork fix (NU6.2) on June 3rd. Despite the technical fix, a major crisis of confidence emerged. The core issue is that Orchard's privacy design makes it cryptographically impossible to prove whether the vulnerability was exploited over the past four years, casting permanent doubt on the historical supply integrity of ZEC. While Shielded Labs argues exploitation was unlikely, the inability to provide definitive proof has severely damaged market trust. This sentiment was exacerbated when BitMEX co-founder Arthur Hayes, a prominent ZEC supporter, announced he was selling his entire position. He stated that privacy assets require "perfect security" rather than "probable safety." The combined effect of the disclosure and Hayes's exit ignited widespread panic selling, leading to massive liquidations and significant price decline. Analysts note the event highlights a fundamental tension within privacy coins: the conflict between verifiable supply and cryptographic privacy.

链捕手59m ago

Privacy Coin Crisis of Confidence! ZEC Plunges Over 56% in a Single Day

链捕手59m ago

Trading

Spot

Futures

Hot Articles

What Is Superchain? Understanding How Superchain Governs and Works in One Article

OP Chain has become a catchy term recently. What is an OP Chain? And what is Superchain? How do Superchain and OP Chains relate? How does Superchain operate and manage?

3.2k Total ViewsPublished 2023.08.13Updated 2024.02.18

What Is Superchain? Understanding How Superchain Governs and Works in One Article

How to Buy ONE

Welcome to HTX.com! We've made purchasing Harmony (ONE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Harmony (ONE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Harmony (ONE)After purchasing your Harmony (ONE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Harmony (ONE)Easily trade Harmony (ONE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

3.8k Total ViewsPublished 2024.03.29Updated 2026.06.02

Understanding Bitcoin Halving in One Article

In this article, we'll delve into key concepts related to Bitcoin halving.

18.5k Total ViewsPublished 2024.04.16Updated 2024.04.16

Understanding Bitcoin Halving in One Article

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of ONE (ONE) are presented below.

Can Humans Control AI? Anthropic Conducted an Experiment Using Qwen

Abstract

01 What Exactly Is This Paper About?

02 Why Choose Qwen?

Related Questions

Related Reads

Crossing the 'Memory Wall': The Wafer-Level Revolution and Computing Power Routes in the AI Inference Era

Has Bitcoin's 'Rebound Ended', Officially Entering the Late Bear Market Phase?

TechFlow Intelligence Agency: Anthropic Calls for Global Pause in AI Development While Preparing for Trillion-Dollar IPO; SpaceX IPO Roadshow Heats Up, But S&P 500 Rejects Fast-Track Inclusion

Dalio Warns: AI Boom Shows Signs of a Bubble, Day of Reckoning Will Be the Time of Burst

Privacy Coin Crisis of Confidence! ZEC Plunges Over 56% in a Single Day

Trading

Hot Articles

What Is Superchain? Understanding How Superchain Governs and Works in One Article

How to Buy ONE

Understanding Bitcoin Halving in One Article

Discussions

Top Questions

Hot Categories

Hot Tags