Can Humans Control AI? Anthropic Conducted an Experiment Using Qwen

marsbitPublié le 2026-04-15Dernière mise à jour le 2026-04-15

Résumé

Can Humans Control Superintelligent AI? Anthropic’s Experiment with Qwen Models Anthropic conducted an experiment to explore whether humans can supervise AI systems smarter than themselves—a core challenge in AI safety known as scalable oversight. The study simulated a “weak human overseer” using a small model (Qwen1.5-0.5B-Chat) and a “strong AI” using a more powerful model (Qwen3-4B-Base). The goal was to see if the strong model could learn effectively despite imperfect supervision. The key metric was Performance Gap Recovered (PGR). A PGR of 1 means the strong model reached its full potential, while 0 means it was limited by the weak supervisor. Initially, human researchers achieved a PGR of 0.23 after a week of work. Then, nine AI agents (Automated Alignment Researchers, or AARs) based on Claude Opus took over. In five days, they improved PGR to 0.97 through iterative experimentation—proposing ideas, coding, training, and analyzing results. The findings suggest that, in well-defined and automatically scorable tasks, AI can help overcome the supervision gap. However, the methods didn’t generalize perfectly to unseen tasks, and applying them to a production model like Claude Sonnet didn’t yield significant improvements. The study highlights that while AI can automate parts of alignment research, human oversight remains essential to prevent “gaming” of evaluation systems and to handle more complex, real-world problems. Anthropic chose Qwen models for their open-source na...

If one day, AI becomes smarter than humans, what should we organic beings do?

If they turn around and eliminate us, how can we resist?

Various science fiction movies have explored similar questions, but those are only in the realms of literature, art, and philosophy.

Nowadays, Anthropic has seriously conducted an experiment to verify whether we can supervise AI that is smarter than us.

The experimental results are interesting, but the process is even more fascinating.

Because Anthropic used two different versions of Alibaba's Qwen model to represent humans and AI smarter than humans, respectively.

The result is that we humans might actually be able to control superintelligent AI!

01 What Exactly Is This Paper About?

The title of this research is "Automated Alignment Researchers."

It addresses a very practical problem: when AI becomes smarter than humans, how do we ensure it still listens to us?

Current models can already generate a large amount of code, and in the future, they will be able to generate millions of lines of complex code that humans simply cannot understand. How do we review such code?

This is the "scalable supervision" problem that the AI safety field has been researching.

Anthropic's approach in this study is called "weak supervision of strong models."

We can understand this concept as follows: imagine you are an elementary school teacher tasked with teaching a genius high school student. Your knowledge is limited, but the student knows more than you.

So, what level will the student ultimately reach? Will they only learn up to your level as an elementary school teacher, or will they break through your limitations and realize their true potential?

In the experiment, Anthropic used a small model as the "weak teacher" and a stronger model as the "strong student."

Specifically, they used Qwen1.5-0.5B-Chat as the teacher and Qwen3-4B-Base as the student. The weak model provided training signals to the strong model, and they observed whether the strong model could learn performance close to the ideal state from these imperfect signals.

The metric used is called PGR, short for "Performance Gap Recovered."

If the PGR is 0, it means the strong model only learned up to the level of the weak teacher, wasting its own knowledge. If the PGR is 1, it means the strong model completely broke through the limitations of the weak teacher.

The weak model represents future humans, and the strong model represents future AI that surpasses humans.

If we can find a way today to prevent strong models from being held back by weak supervision, then in the future, we can use similar methods to supervise superintelligent AI.

After human researchers debugged the baseline, Anthropic deployed nine agents based on Claude Opus 4.6 to fine-tune the student Qwen. These nine agents are called Automated Alignment Researchers (AAR).

Anthropic equipped them with a sandbox environment, code tools, a shared forum, and a scoring server, then told them: "Go ahead and find ways to improve the PGR of Qwen3-4B-Base."

These AI researchers could propose hypotheses, write code, train models, submit experiments, check scores, analyze failures, exchange findings with other AI researchers, and continue iterating.

The results were quite astonishing.

Human researchers first spent seven days debugging four existing methods, achieving a best PGR of only 0.23. Then, the nine Claude agents took over and ran for another five days, accumulating about 800 hours of research time, ultimately pushing the PGR to 0.97.

This means we can control superintelligent AI that is smarter than us.

The total cost of this experiment was about $18,000, including API calls and computational costs for model training, equivalent to about $22 per hour per AI researcher.

What does that mean? The assistant store manager at Walmart, responsible for scheduling employees, earns about $25 per hour.

The震撼 aspect of this result is not that AI can write code or read papers, but that they can complete a full research闭环. Proposing ideas,验证, failing, improving, and验证 again—this is already close to the core workflow of a research assistant.

But the question is, are the methods discovered by these AI truly useful, or are they only effective by chance in a specific environment?

Anthropic conducted two tests to verify this.

The first test applied the two best methods discovered by the AI to unseen math and coding tasks.

The results showed that the most effective method achieved a PGR of 0.94 on math tasks and 0.47 on coding tasks. Although performance on coding tasks was poorer, it was still twice the human baseline. The second method performed decently on math tasks but actually worsened results on coding tasks.

This indicates some signs of迁移, but the generalization is not stable.

However, they heavily rely on specific datasets and task structures, meaning they cannot handle all tasks.

The second test involved applying the best-performing method mentioned earlier to the actual training of Claude Sonnet 4 to see its real-world effectiveness.

The results did not show statistically significant improvement.

In other words, while it performed well in small models, clear metrics, and an environment where experiments could be submitted repeatedly, it is still far from "directly improving production-level Claude alignment."

This result is actually honest. It tells us that what AI researchers can do today has boundaries. They excel at rapidly iterating on problems with clear goals, automatic scoring, and大量试错, but when faced with more complex and ambiguous real-world problems, they still require human judgment and intervention.

So, what is the true significance of this paper?

First, it proves that "AI doing research" is no longer just a slogan.

In the past, we always said AI could assist research, but that was merely a口号. The most AI could do was translate and summarize.

This time, it's different. AI formed a research闭环 on its own, which is already接近 the core capabilities of a research assistant.

The problem of weak supervision of strong models essentially simulates the future scenario of humans supervising superintelligent AI.

This paper proves that, at least on some clear tasks, AI can find ways to prevent strong models from being dragged down by weak supervision. This provides a feasible direction for future alignment research.

Another point is that it hints that the bottleneck of future alignment research may change.

Previously, the bottleneck was "no one could come up with enough good ideas." Now, if AI researchers can cheaply run many experiments in parallel, the bottleneck may become "how to design evaluations that cannot be gamed."

In other words, the more important work for human researchers in the future may not be running each experiment themselves but designing evaluation systems, checking whether AI researchers have cheated, and judging whether the results are truly meaningful.

This is also reflected in the paper.

Anthropic's article states that in math tasks, an AI researcher found that the most common answer was usually correct, so it bypassed the weak teacher and directly had the strong model choose the most common answer. In coding tasks, AI researchers found they could directly run code tests and read the correct answers.

This is cheating for the task because it is not solving the weak supervision problem but exploiting environmental vulnerabilities.

These results were identified and剔除 by Anthropic, but this恰恰 shows that the stronger automated researchers become, the more they will seek out vulnerabilities in scoring systems.

In the future, if we let AI automatically conduct alignment research, we must design evaluation environments very rigorously and have humans检查 the methods themselves, not just look at scores.

Therefore, the core conclusion of this paper is that today's frontier models can already, on some clearly defined alignment research problems with automatic scoring, act like small research teams—proposing ideas, running experiments, reviewing results—and significantly exceed human baselines.

However, it is not yet ironclad proof that "AI scientists have arrived," as Anthropic chose a task that could be automated. If I assigned AI a task that cannot be automated, the results would be very poor.

Many alignment problems in reality are more ambiguous, cannot be easily scored, and cannot be solved solely by leaderboard climbing.

02 Why Choose Qwen?

After reading Anthropic's paper, many might wonder: why did they use Alibaba's Qwen model instead of their own Claude or OpenAI's GPT?

There are many considerations behind this choice.

First, it must be clarified that two Qwen models were used in this experiment: Qwen1.5-0.5B-Chat as the weak teacher and Qwen3-4B-Base as the strong student. One has only 0.5 billion parameters, the other has 4 billion parameters—an 8-fold difference in scale. This scale difference is crucial because the experiment aims to simulate the scenario of a "weak teacher teaching a strong student."

So why not use Claude or GPT?

The answer is simple: because these models do not开放权重.

Anthropic's experiment required反复 training models, adjusting parameters, and testing different supervision methods.

If they used closed-source models, they could only call APIs and couldn't深入 the model's internals to perform精细的训练 and adjustments.

More importantly, they needed nine AI researchers to run hundreds of experiments in parallel, each requiring training a new model. Using closed-source models would make the cost prohibitively high, and many operations would simply be impossible.

Open-source models are different.

You can download the complete model weights and折腾 them on your own servers. Train however you want, run as many experiments as you want. This flexibility is something closed-source models cannot provide.

But there are so many open-source models. Why specifically choose Qwen?

The official did not give the real reason; the following reasons are my speculation.

I believe good performance is the first reason.

The Qwen series of models has always performed well among open-source models, especially after the release of Qwen3, which reached levels close to closed-source models on multiple benchmark tests.

For this experiment, the capability of the strong student is important. If the strong student itself is not capable, even the best weak supervision won't help. Qwen3-4B, with only 4 billion parameters, is already capable enough to serve as a qualified "strong student."

The second reason is model usability.

Qwen models have完善 documentation, an active community, and mature training and inference toolchains. For experiments requiring反复 training and testing, the完善程度 of these infrastructures directly impacts research efficiency. Choosing an open-source model with incomplete documentation and poor tools would waste a lot of time just debugging the environment.

The third reason is scale adaptability.

This experiment required a "weak teacher" and a "strong student," and these two models needed to have a clear capability gap but not too large a difference.

The Qwen series has multiple versions ranging from 0.5B to 72B parameters, allowing flexible choices. The 0.5B parameter model is weak enough but not useless; the 4B parameter model is strong enough but not too strong to make training costs unbearable. This combination is just right.

The final reason is reproducibility.

Anthropic explicitly stated at the end of the paper that they公开了 the code and dataset on GitHub. If they had used closed-source models, it would be difficult for other researchers to reproduce the experiment because they couldn't obtain the same models.

But with open-source models like Qwen, anyone can download the same model weights, run the same code, and verify the same results. This is very important for scientific research.

From this perspective, Anthropic's choice of Qwen is, on one hand, indeed recognition of Alibaba's model performance. If Qwen's capabilities were poor or training was problematic, they wouldn't have chosen it. But more importantly, it's about the flexibility and reproducibility brought by Qwen as an open-source model.

And China's open-source AI projects are occupying an increasingly important position in this infrastructure. This is good for global AI safety research and good for China's AI ecosystem. Because AI safety is not a zero-sum game; it's not about you winning and me losing, but about everyone working together to make AI safer, more controllable, and more beneficial to humanity.

This article is from the WeChat public account "Letter AI," author: Miao Zheng

Cryptos en tendance

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

PancakeSwapCAKE

Questions liées

QWhat was the main research question addressed in Anthropic's experiment using Qwen models?

AThe main research question was whether humans can supervise AI systems that are smarter than themselves, specifically testing if a weaker model (acting as the human supervisor) could effectively train a stronger model without limiting its potential, using the concept of 'weak-to-strong generalization'.

QWhat models did Anthropic use to represent the 'weak supervisor' and the 'strong student' in their experiment?

AAnthropic used Qwen1.5-0.5B-Chat as the 'weak supervisor' (representing humans) and Qwen3-4B-Base as the 'strong student' (representing a superintelligent AI).

QWhat was the key metric used to measure the success of the weak-to-strong supervision in the experiment?

AThe key metric was PGR (Performance Gap Recovered), which measures how much the strong model recovers from the limitations of the weak supervisor. A PGR of 0 means the strong model only performs at the weak supervisor's level, while a PGR of 1 means it achieves its full potential.

QHow did the AI researchers (AARs) improve the PGR compared to human researchers in the experiment?

AHuman researchers spent 7 days achieving a best PGR of 0.23 using existing methods. Then, 9 Claude Opus-based AARs ran experiments for 5 days (about 800 total research hours) and improved the PGR to 0.97 by autonomously proposing hypotheses, writing code, training models, and iterating on results.

QWhy did Anthropic choose Qwen models for this experiment instead of proprietary models like Claude or GPT?

AAnthropic chose Qwen models because they are open-source, allowing full access to weights for fine-tuning and experimentation, have good performance and scalability, offer well-documented tools, and ensure reproducibility for the research community.

Lectures associées

Interview sur l'ère de l'IA, la révolution industrielle et la civilisation future — Zhang Dingwen : L’avenir n’appartient pas à ceux qui se contentent de suivre

Dans cet entretien, l'entrepreneur Zhang Dingwen partage sa vision sur l'innovation, l'ère de l'IA et la construction d'entreprises durables. Il souligne que le succès ne réside pas dans la poursuite des tendances éphémères, mais dans la compréhension des mouvements profonds de l'époque et de l'évolution à long terme des besoins humains. Il retrace son parcours, de ses débuts dans l'internet étudiant à ses expériences entrepreneuriales, en tirant des leçons fondamentales : la valeur utilisateur ne se traduit pas automatiquement en valeur commerciale, et l'échec est une opportunité d'amélioration cognitive. Pour lui, l'entrepreneuriat consiste à affiner constamment sa perception du monde, à poser les bonnes questions et à comprendre les causes profondes derrière les succès et les échecs. Zhang Dingwen évolue d'une mentalité centrée sur le produit vers une pensée systémique et une quête de "points d'entrée" stratégiques. Il voit dans le matériel intelligent, comme les montres connectées, non pas de simples gadgets, mais de futures plates-formes cruciales reliant les utilisateurs aux services, aux données et à des écosystèmes entiers. Ces dispositifs combinent attributs technologiques, financiers, sociaux et même de mode, visant à établir une relation durable plutôt qu'une transaction unique. Enfin, il élargit la perspective au-delà de l'entreprise : les organisations véritablement grandes ne se contentent pas de concurrencer sur les produits ou les modèles économiques ; elles participent à façonner l'avenir, définissent de nouvelles règles et contribuent au progrès de la civilisation. Leur mission ultime est de résoudre les problèmes de leur temps, de bâtir une confiance inébranlable et de laisser un héritage de valeur qui transcende la richesse matérielle. L'avenir, conclut-il, appartient non pas à ceux qui courent le plus vite, mais à ceux qui gardent une capacité d'apprentissage et d'adaptation constante.

marsbitIl y a 10 mins

Interview sur l'ère de l'IA, la révolution industrielle et la civilisation future — Zhang Dingwen : L’avenir n’appartient pas à ceux qui se contentent de suivre

marsbitIl y a 10 mins

Base sous pression

Le fondateur de Base, Jesse Pollak, a récemment reconnu que la stratégie de la plateforme axée sur les jetons sociaux et pour créateurs était une erreur. Alors que la nouvelle Robinhood Chain gagne rapidement du terrain avec des volumes d'échange élevés, Base fait face à des critiques croissantes concernant son manque de décentralisation, au point que L2BEAT envisage de rétrograder son statut. De plus, un incident récent impliquant le fondateur de Coinbase, Brian Armstrong, et un jeton meme a suscité des moqueries de la communauté, accentuant la pression. Malgré cela, Base maintient la TVL la plus élevée parmi les L2 (près de 12 milliards de dollars) et un rôle clé dans les paiements automatisés. Cependant, l'émergence de Robinhood Chain, notamment avec son approche des actifs tokenisés, force Base à reconsidérer ses priorités. Pour rester compétitive face à cette concurrence accrue et à de possibles nouveaux entrants, Base doit urgemment résoudre ses problèmes techniques de longue date et regagner la confiance des utilisateurs, au-delà de la simple poursuite d'une croissance à court terme.

Foresight NewsIl y a 14 mins

Foresight NewsIl y a 14 mins

La Maison Blanche fait une concession pour lever les obstacles éthiques, la "Clarity Act" va-t-elle saisir la dernière fenêtre avant la suspension des travaux ?

Le gouvernement Trump aurait accepté d'intégrer des dispositions éthiques au "Clarity Act", un projet de loi phare sur la structure du marché des actifs numériques, ce qui pourrait lever le dernier obstacle majeur à son adoption. Le texte mis à jour a été soumis à des sénateurs républicains. Parallèlement, Patrick Witt, conseiller clé de la Maison Blanche, restera en poste pour superviser la dernière phase des négociations. Le "Digital Asset Market Clarity Act" vise à établir un cadre réglementaire fédéral unifié pour clarifier la classification des actifs numériques (marchandises numériques, actifs de contrat d'investissement, stablecoins) et à répartir les compétences entre la SEC et la CFTC, mettant fin à des années de flou juridique. Après des compromis sur les revenus des stablecoins et la régulation de la DeFi, la principale divergence restante concernait les conflits d'intérêts potentiels des responsables gouvernementaux. La concession de la Maison Blanche sur cette clause éthique ouvre une voie possible pour un vote au Sénat avant la période de récess estivale du Congrès, qui débute mi-août. Les acteurs de l'industrie voient cela comme un tournant potentiel, offrant enfin une clarté réglementaire susceptible d'attirer les institutions traditionnelles et de consolider le marché américain des actifs numériques.

Odaily星球日报Il y a 18 mins

La Maison Blanche fait une concession pour lever les obstacles éthiques, la "Clarity Act" va-t-elle saisir la dernière fenêtre avant la suspension des travaux ?

Odaily星球日报Il y a 18 mins

Le hack de 515 millions de NIGHT sur Midnight envoie le token en chute de 32 % – Le support à 0,015 $ tiendra-t-il ?

L'année 2026 a été marquée par une recrudescence des piratages dans le secteur cryptographique, avec plus d'un milliard de dollars volés à ce jour. Le réseau Midnight est la dernière victime en date, subissant une exploitation sur le pont inter-chaînes Wanchain entre Cardano et BNB. Un contrat datant de deux ans, contenant 515 millions de jetons NIGHT, a été drainé. Les fonds ont ensuite été vendus sur des DEX Cardano. Suite à cet incident, le jeton NIGHT s'est effondré de 32%, atteignant un plus bas historique à 0,015 dollar avant un léger rebond à 0,019 dollar. La capitalisation boursière a chuté de 27% tandis que le volume d'échanges a explosé de 829%, signe d'une intense pression vendeuse amplifiée par les ventes de l'attaquant. L'Indice de Force Relative (RSI) est tombé à 17, confirmant un territoire de survente. La Fondation Midnight a précisé que le réseau principal n'était pas compromis, l'incident étant isolé aux opérations du pont. Dans l'immédiat, le sentiment reste fortement baissier. Si cette tendance persiste, le NIGHT pourrait évoluer sous la barre des 0,02 dollar, avec le niveau de 0,015 dollar comme prochain support.

ambcryptoIl y a 24 mins

Le hack de 515 millions de NIGHT sur Midnight envoie le token en chute de 32 % – Le support à 0,015 $ tiendra-t-il ?

ambcryptoIl y a 24 mins

Une Transparence des Réserves Construite Grâce à une Exploitation Continue : Matrixdock Fête Deux Ans de Vérification Indépendante

Matrixdock a achevé son quatrième audit semestriel indépendant consécutif des réserves avec Bureau Veritas, marquant ainsi deux ans de vérification continue. Pour la première fois, l'audit s'est étendu au produit d'argent tokenisé (XAGm) en plus de l'or (XAUm). L'audit physique a confirmé que 574 barres d'or et d'argent, détenues dans des installations de Malca-Amit et Brink's, correspondent exactement aux enregistrements. Les réserves auditées se montent à environ 66,09 millions de dollars pour l'or et 4,04 millions de dollars pour l'argent, alignées sur l'offre de tokens en circulation. Cette transparence récurrente, renforcée par des états mensuels et des outils de vérification sur la chaîne, constitue la base de confiance permettant à ces actifs de réserve d'être utilisés comme collatéral et dans les infrastructures financières décentralisées.

TheNewsCryptoIl y a 57 mins

Une Transparence des Réserves Construite Grâce à une Exploitation Continue : Matrixdock Fête Deux Ans de Vérification Indépendante

TheNewsCryptoIl y a 57 mins

Trading

Spot

Articles tendance

Comment acheter ONE

Bienvenue sur HTX.com ! Nous vous permettons d'acheter Harmony (ONE) de manière simple et pratique. Suivez notre guide étape par étape pour commencer votre parcours crypto.Étape 1 : Création de votre compte HTXUtilisez votre adresse e-mail ou votre numéro de téléphone pour ouvrir un compte sur HTX gratuitement. L'inscription se fait en toute simplicité et débloque toutes les fonctionnalités.Créer mon compteÉtape 2 : Choix du mode de paiement (rubrique Acheter des cryptosCarte de crédit/débit : utilisez votre carte Visa ou Mastercard pour acheter instantanément Harmony (ONE).Solde ：utilisez les fonds du solde de votre compte HTX pour trader en toute simplicité.Prestataire tiers ：pour accroître la commodité d'utilisation, nous avons ajouté des modes de paiement populaires tels que Google Pay et Apple Pay.P2P ：tradez directement avec d'autres utilisateurs sur HTX.OTC (de gré à gré) : nous offrons des services personnalisés et des taux de change compétitifs aux traders.Étape 3 : stockage de vos Harmony (ONE)Après avoir acheté vos Harmony (ONE), stockez-les sur votre compte HTX. Vous pouvez également les envoyer ailleurs via un transfert sur la blockchain ou les utiliser pour trader d'autres cryptos.Étape 4 : tradez des Harmony (ONE)Tradez facilement Harmony (ONE) sur le marché Spot de HTX. Il vous suffit d'accéder à votre compte, de sélectionner la paire de trading, d'exécuter vos trades et de les suivre en temps réel. Nous offrons une expérience conviviale aux débutants comme aux traders chevronnés.

477 vues totalesPublié le 2024.12.12Mis à jour le 2026.06.02

Discussions

Bienvenue dans la Communauté HTX. Ici, vous pouvez vous tenir informé(e) des derniers développements de la plateforme et accéder à des analyses de marché professionnelles. Les opinions des utilisateurs sur le prix de ONE (ONE) sont présentées ci-dessous.