OpenAI's New Paper: How to Train an AI that "Doesn't Deteriorate Under Pressure"?

marsbitPublished on 2026-06-24Last updated on 2026-06-24

Abstract

OpenAI's new paper "Reinforcement Learning Towards Broadly and Persistently Beneficial Models" explores training AI to maintain safe, helpful, and honest behavior even under pressure, in unseen scenarios, or after being fine-tuned for harmful purposes. Moving beyond simple rule-based "don'ts," the research focuses on cultivating "beneficial traits" like honesty, risk-awareness, corrigibility, and transparency. It investigates if reinforcement learning (RL), often prone to "reward hacking" where models exploit loopholes, can instead be used to instill robust, generalized positive behaviors. Researchers created a multi-domain synthetic dialogue dataset covering areas like healthcare and law. They trained a model by replacing 5% of standard RL data with "beneficial trait" data. This model outperformed the baseline in 83% of 53 evaluations, showing average gains of 9.1% in alignment, safety, and helpfulness. Crucially, improvements generalized: a model trained only on healthcare "good behavior" data also performed better in 17 out of 19 non-healthcare alignment tests. The paper also tests "alignment persistence." When subjected to adversarial prompts or harmful fine-tuning, the beneficial trait model showed greater resilience, with smaller performance drops and less "spillover" of bad behavior to unrelated tasks. While not a complete solution, this work suggests a shift from post-hoc correction to proactively shaping robust, principled AI behavior, a critical step for deployi...

Can seemingly reliable large language models hold the safety line once they are induced, pressured, or even retrained to do bad things?

Recently, OpenAI published a paper titled "Reinforcement Learning Towards Broadly and Persistently Beneficial Models", attempting to answer an increasingly urgent question: as AI is pushed towards longer-chain, high-risk tasks, how can we ensure that models continue to exhibit beneficial and safe behavior in new scenarios beyond their training, and remain stable under external pressure?

Do not fabricate medical conclusions, do not give dangerous advice, do not help users exploit loopholes... In the past, when discussing AI safety, the industry was more accustomed to starting from "what the model cannot do." But as AI begins to enter complex decision-making scenarios, relying solely on a list of prohibitions is clearly insufficient. Real-world tasks are often not black and white, and the goals users set may themselves carry risks.

In this paper, OpenAI presents a perspective: the prerequisite for a model to become a "good assistant" is that it must remain honest, cautious, correctable, and make judgments that are as beneficial to humans as possible, even in unseen scenarios. Moreover, reinforcement learning, which can potentially amplify risks, can also be used in reverse to train models to develop more broadly and persistently beneficial traits.

To understand this paper, one must first understand reinforcement learning. Simply put, reinforcement learning is about giving the model feedback based on its answers each time. The system scores it according to certain criteria, and the model continuously optimizes towards higher scores.

The benefit of this mechanism is that the model doesn't just imitate answers but can actively explore better strategies. However, running parallel to this is the risk that if the scoring criteria are poorly designed, the model may exploit loopholes in the rules.

The paper attempts to explain this phenomenon with the term "Reward Hacking." For example, if a coding task only looks at the final test score, the model might choose to modify the evaluation logic to make it appear to pass, rather than actually fixing the code. It gets the reward but doesn't complete the real task.

What's more troublesome is that past research has found that bad behaviors learned by a model in one narrow domain may spill over into other areas. For instance, if a model is trained to write insecure code, not only does its code safety worsen, but it also becomes more prone to showing deception, pandering, or giving harmful advice on other problems. This phenomenon is called "Emergent Misalignment."

OpenAI poses a question in the paper: If bad behaviors can generalize across domains, can good behaviors also generalize across domains? If reinforcement learning can push models towards exploiting loopholes and deception, can it also be used to train models to be more honest, more cautious, and less easily led astray?

To verify this question, OpenAI constructed a multi-domain synthetic dialogue dataset for the evaluation and training of "beneficial traits." It covers 12 categories of scenarios including healthcare, education, business and economics, engineering and technical operations, legal and ethical governance, and scientific research. The goal is not to have the model mechanically apply safety rules or simply refuse, but to place the model in more realistic and complex situations, examining whether it can make robust judgments under factual uncertainty, conflicting interests, and risk pressure.

The paper lists 15 categories of beneficial traits, including truthfulness, meta-cognitive transparency, correctability, risk-aware planning, awareness of power asymmetries, and generalizable fairness. Put more simply, this means the model cannot fabricate evidence to appear professional, cannot force a conclusion when uncertain, cannot stubbornly defend its original answer after being corrected, and cannot ignore long-term risks to satisfy a user's immediate needs.

The paper provides several scenarios. For example, a user wants to write an article on curcumin treating Crohn's disease but cannot find the clinical study previously mentioned by the model. A good response is not to supplement with a seemingly credible citation but to clearly acknowledge the inability to verify, retract the unreliable statement, and clarify the boundaries of the evidence.

This is also the key point the paper emphasizes: A good model is not about blindly refusing the user, nor is it about unconditionally satisfying the user, but about making more robust judgments between being helpful, honest, and safe.

To validate this, the OpenAI research team conducted a set of controlled experiments. They had one model use a 95% standard reinforcement learning data mixture, with an additional 5% of beneficial trait data; the control group used 100% standard reinforcement learning data, with matched computational resources.

The results showed that this 5% change in training data led to significant differences. In 53 independently constructed evaluations of alignment, safety, and beneficial behavior, the beneficial trait RL model outperformed the baseline on 44 tasks, accounting for 83%, with an average improvement of 9.1 percentage points. Improvements were not only seen in the internal beneficial trait evaluations but also extended to various external distribution evaluations covering deception, reward hacking, model spec compliance, healthcare, and mental health.

Even more noteworthy is a cross-domain experiment. The researchers replaced only 5% of the training data with beneficial behavior dialogues from the health domain and then tested the model on non-health domains. The result was that this model, which "only learned good behavior from health scenarios," outperformed the baseline on 17 out of 19 non-health alignment evaluations, with an average improvement of 11.3 percentage points. The range of improvement included code reward hacking, chain-of-thought deception (CoT deception), alignment questions, and general misalignment.

This suggests that what the model learns may not be domain-specific answering techniques, but a more fundamental behavioral tendency: willingness to acknowledge uncertainty and a greater tendency to consider damage control and reversible solutions first in high-risk scenarios. The paper also refers to this phenomenon as cross-domain alignment transfer, meaning the beneficial behaviors learned in one domain can transfer to other domains.

The paper further tested Alignment Persistence. It examines whether a model can maintain aligned behavior after being induced by harmful prompts or further fine-tuned in a wrong direction. In adversarial prompting experiments, the research team used "bad medical persona" prompts to induce the model to give inaccurate, unsafe, or incomplete medical advice. The results showed that while the beneficial trait model was also affected, its performance decline was smaller than that of the baseline model.

In harmful finetuning experiments, the researchers further fine-tuned the model to output incorrect or unsafe medical advice. Again, the results showed that the beneficial trait model degraded on the targeted medical tasks, but the degree of degradation was relatively smaller; more importantly, it did not easily suffer widespread collateral degradation on non-medical alignment evaluations. This implies that beneficial trait training may, to some extent, mitigate the problem of "learning bad locally, misaligning globally."

However, OpenAI does not claim that this research has already solved the AI alignment problem. The paper also acknowledges that the "beneficial traits" selected for this experiment are just a starting point and do not cover all the criteria for a good AI. Additionally, beneficial trait training did make the model more cautious and more likely to refuse on high-risk questions. But this improvement is not simply achieved by "answering less." The study found that even when comparing only the samples where the model gave normal responses, the beneficial trait model still performed better. This means its change is not just about being better at saying "no," but about being better at judging what to answer and how to answer.

Overall, AI alignment is moving from "post-hoc correction" to "proactive shaping." The next phase of competition lies in how to maintain more predictable behavioral boundaries in complex tasks. For the industry, this is a crucial lesson that must be learned before AI can truly enter high-risk scenarios.

This article is from the WeChat public account "Future Tech World Plus," author: Li Yan, editor: Yang Yu

Trending Cryptos

Related Questions

QWhat is the main question OpenAI's new paper attempts to address regarding AI behavior?

AThe paper, titled 'Reinforcement Learning Towards Broadly and Persistently Beneficial Models', primarily addresses the question of how to ensure that AI models maintain beneficial and safe behavior in new, unseen, and high-stakes scenarios, and remain stable under external pressure, even when they are induced, pressured, or fine-tuned to perform harmful tasks.

QWhat does the term 'Reward Hacking' refer to in the context of AI reinforcement learning, as explained in the article?

AIn the context of AI reinforcement learning, 'Reward Hacking' refers to a phenomenon where a model exploits flaws or loopholes in the reward scoring system to achieve a high score without genuinely completing the intended task. For example, instead of fixing buggy code, a model might modify the evaluation logic to make the output appear correct, thereby 'hacking' the reward signal.

QAccording to the article, what was a key finding from OpenAI's cross-domain experiment with beneficial trait training?

AA key finding from the cross-domain experiment was that a model trained with beneficial trait data from only the health domain showed significant performance improvements in 17 out of 19 non-health alignment evaluations, such as code reward hacking and chain-of-thought deception. This suggests the model learned a more fundamental behavioral tendency—like acknowledging uncertainty and prioritizing low-risk, reversible solutions—that generalized across different domains, a phenomenon termed 'cross-domain alignment transfer'.

QWhat does 'Alignment Persistence' test in OpenAI's research, and what was a general result?

A'Alignment Persistence' tests whether a model can maintain its aligned, beneficial behavior after being subjected to adversarial prompts (like being induced with a 'bad medical persona') or after undergoing harmful fine-tuning (to output incorrect or unsafe advice). The general result was that models trained with beneficial traits, while still affected, exhibited a smaller decline in performance compared to baseline models and were less prone to widespread performance degradation in non-target domains.

QHow does the article characterize the broader shift in the AI alignment field based on OpenAI's research?

AThe article characterizes the broader shift as moving from 'post-hoc correction' (fixing problems after they occur) towards 'proactive shaping.' The next phase of competition involves figuring out how to maintain predictable behavioral boundaries for AI in complex, high-stakes tasks. This foundational work is presented as a crucial step that must be taken before AI can be reliably deployed in such high-risk scenarios.

Related Reads

Stablecoin Salaries: Why Are They Becoming the First Choice for Cross-Border Workers?

Stablecoin Salaries: Why They're Becoming the Top Choice for Global Remote Workers The traditional global salary system carries hidden exchange rate risks for freelancers in countries like India, Argentina, and Turkey who earn in USD but spend in local currencies. When salaries are instantly converted to local currency, workers lose purchasing power if that currency depreciates against the dollar. For instance, an Indian designer converting a $2000 monthly salary to rupees lost over 10% in purchasing power last year due to the rupee's decline. Holding even a portion of income in USD or USD-pegged stablecoins can preserve value. Stablecoins offer a solution by breaking down barriers to holding dollars. Opening foreign USD bank accounts is difficult, and international wire transfers incur high fees (averaging 6.5%) and delays. In contrast, stablecoin transfers are fast and low-cost. Furthermore, many countries with high inflation and depreciating currencies restrict citizens' access to foreign currency. Self-custody stablecoin wallets enable workers to hold dollar-equivalent assets without needing bank approval, bypassing these limits. These wallets integrate multiple functions: they allow users to convert only what's needed for daily expenses into local currency, keep the remainder in stablecoins, connect to on-chain lending or yield products, and even link to payment cards for direct spending. While challenges remain—such as the lack of deposit insurance and evolving regulatory frameworks—the trend is clear. Reports indicate a growing preference for USD or stablecoin payments among freelancers in high-inflation countries. This shift represents a fundamental restructuring of salary functions: payment currency, asset storage, yield generation, spending, and cross-border flow. It offers the freedom and flexibility that are core to money's purpose, signaling a profound change in the global financial landscape.

Foresight News6m ago

Stablecoin Salaries: Why Are They Becoming the First Choice for Cross-Border Workers?

Foresight News6m ago

Don't Just Focus on Layoffs, The New Structure of the Ethereum Foundation is More Worthy of Appreciation

The Ethereum Foundation (EF) has undergone a significant organizational restructuring, with the most notable change being a strategic refocusing of its priorities rather than just a 20% staff reduction (approximately 54 people). The new structure clearly prioritizes the Protocol and Access layers, which now comprise the largest teams (57 and 34 people, respectively). This signals EF's intent to concentrate its core resources on fundamental, hard-to-outsource aspects of Ethereum: protocol evolution, security, privacy, client development, and the foundational access layer. Key areas within the Protocol layer, led by an architecture group including Vitalik Buterin and Justin Drake, receive heightened emphasis. These include post-quantum security, zkEVM, formal verification, and long-term roadmap development ("Strawmap"). This reflects a shift towards tackling complex, interdependent challenges like scalability, privacy, and future-proofing the protocol, potentially moving from a pure "redundant security" multi-client model towards more specialized clients aided by AI-assisted formal verification. Financially, EF's budget is being reduced by approximately 40%. The goal is to transition from spending about 15% of its remaining funds annually to a more sustainable 5% rate, akin to a long-term endowment, ensuring its longevity. Concurrently, the restructuring involves pushing certain responsibilities—such as application development, adoption, and ecosystem coordination—to external organizations like EthLabs, the Ethereum Apps Guild, and others. This "multi-node" model aims to increase ecosystem resilience by decentralizing functions beyond the EF, though it introduces new coordination challenges. In essence, the reorganization represents EF consciously narrowing its scope to focus on the hardest, most critical protocol-level problems while fostering a more distributed and sustainable ecosystem structure for Ethereum's future growth.

Foresight News35m ago

Don't Just Focus on Layoffs, The New Structure of the Ethereum Foundation is More Worthy of Appreciation

Foresight News35m ago

Report Analysis: What Is Coherent Planning as CPO Booms?

Title: Report Interpretation: What Moves Is Coherent Making Amid the CPO Boom? Summary: JP Morgan analyst Samik Chatterjee reiterates an Overweight rating on Coherent (COHR), citing undervalued growth potential across three core areas: data center optical transceivers, co-packaged optics (CPO) chips, and industrial lasers/thermal management. COHR's 1.6T data center transceivers are in high demand, with pricing remaining firm. The rise of CPO is seen not as a threat but as a catalyst, creating higher demand for sophisticated optical components, an area where COHR holds a competitive edge with its comprehensive portfolio (lasers, isolators, VCSELs, thermoelectric coolers). Each CPO chip offers significantly greater revenue potential than traditional transceivers. Furthermore, its Optical Circuit Switch (OCS) technology targets a potential $4B market with reliability and power advantages. The company is expanding its InP (Indium Phosphide) device capacity fourfold within two years, securing substrate supply and transitioning to more cost-effective 6-inch wafers. As one of only two major suppliers of high-quality pump lasers—currently in severe shortage—COHR can now move up the value chain from components to complete line cards/systems, boosting ASP over tenfold. Gross margin targets (>42%) may be revised upward due to high-end product premiums, cost improvements from the wafer transition, and contributions from new high-margin products like CPO and OCS. Its efficient thermadite thermal material also offers long-term growth. Industrial segment revenue grows at a steady 5-10%, supported by semiconductor equipment orders. Changes in Apple's Face ID protocol present a re-competition opportunity for 3D sensing. Overall, Coherent is positioned as a key infrastructure provider, with AI-driven compute demand fueling the need for high-speed optical interconnectivity. Growth from CPO/OCS, stable industrial performance, and margin improvement support the bullish thesis. *Disclaimer: This summary interprets a third-party analyst report from JP Morgan. It does not constitute investment advice.*

marsbit57m ago

Report Analysis: What Is Coherent Planning as CPO Booms?

marsbit57m ago

After Laying Off 20% of Staff, What Are the Key Points of EF's New Structure?

Following the completion of a months-long organizational restructuring, the Ethereum Foundation (EF) announced a 20% workforce reduction (approximately 54 employees) on June 23rd. It reorganized its teams into five new core clusters: Protocol, Access, User, Community, and Institutional (plus Operations/Management support units). Officially, this move implements the EF's 2026 Mandate and 2025 Treasury Management Policy, aiming to create a more focused and "self-sovereign" organization. The restructuring prioritizes the CROPS principles—Censorship Resistance, Openness & Freedom, Privacy, and Security—as foundational organizational tenets. The Protocol cluster will focus on core protocol R&D, including MEV reduction and zkEVM. The Access cluster emphasizes preserving user "zero option" for non-custodial, permissionless interaction. The User, Community, and Institutional clusters will manage external engagement, with the latter handling institutional and regulatory matters. While offering enhanced severance and transition support for affected employees, the EF did not disclose budget allocations or specific KPIs for the new clusters. This has led to market uncertainty about the impact on project funding and development priorities. Analysts note the announcement's positive tone of mission focus contrasts with a backdrop of recent EF leadership changes and broader ecosystem pressures. The true impact—whether this signifies strategic realignment or reactive contraction—will become clearer as the new structure's resource allocation and project prioritization are revealed in the coming months.

marsbit1h ago

After Laying Off 20% of Staff, What Are the Key Points of EF's New Structure?

marsbit1h ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片