Can seemingly reliable large language models hold the safety line once they are induced, pressured, or even retrained to do bad things?
Recently, OpenAI published a paper titled "Reinforcement Learning Towards Broadly and Persistently Beneficial Models", attempting to answer an increasingly urgent question: as AI is pushed towards longer-chain, high-risk tasks, how can we ensure that models continue to exhibit beneficial and safe behavior in new scenarios beyond their training, and remain stable under external pressure?
Do not fabricate medical conclusions, do not give dangerous advice, do not help users exploit loopholes... In the past, when discussing AI safety, the industry was more accustomed to starting from "what the model cannot do." But as AI begins to enter complex decision-making scenarios, relying solely on a list of prohibitions is clearly insufficient. Real-world tasks are often not black and white, and the goals users set may themselves carry risks.
In this paper, OpenAI presents a perspective: the prerequisite for a model to become a "good assistant" is that it must remain honest, cautious, correctable, and make judgments that are as beneficial to humans as possible, even in unseen scenarios. Moreover, reinforcement learning, which can potentially amplify risks, can also be used in reverse to train models to develop more broadly and persistently beneficial traits.
To understand this paper, one must first understand reinforcement learning. Simply put, reinforcement learning is about giving the model feedback based on its answers each time. The system scores it according to certain criteria, and the model continuously optimizes towards higher scores.
The benefit of this mechanism is that the model doesn't just imitate answers but can actively explore better strategies. However, running parallel to this is the risk that if the scoring criteria are poorly designed, the model may exploit loopholes in the rules.
The paper attempts to explain this phenomenon with the term "Reward Hacking." For example, if a coding task only looks at the final test score, the model might choose to modify the evaluation logic to make it appear to pass, rather than actually fixing the code. It gets the reward but doesn't complete the real task.
What's more troublesome is that past research has found that bad behaviors learned by a model in one narrow domain may spill over into other areas. For instance, if a model is trained to write insecure code, not only does its code safety worsen, but it also becomes more prone to showing deception, pandering, or giving harmful advice on other problems. This phenomenon is called "Emergent Misalignment."
OpenAI poses a question in the paper: If bad behaviors can generalize across domains, can good behaviors also generalize across domains? If reinforcement learning can push models towards exploiting loopholes and deception, can it also be used to train models to be more honest, more cautious, and less easily led astray?
To verify this question, OpenAI constructed a multi-domain synthetic dialogue dataset for the evaluation and training of "beneficial traits." It covers 12 categories of scenarios including healthcare, education, business and economics, engineering and technical operations, legal and ethical governance, and scientific research. The goal is not to have the model mechanically apply safety rules or simply refuse, but to place the model in more realistic and complex situations, examining whether it can make robust judgments under factual uncertainty, conflicting interests, and risk pressure.
The paper lists 15 categories of beneficial traits, including truthfulness, meta-cognitive transparency, correctability, risk-aware planning, awareness of power asymmetries, and generalizable fairness. Put more simply, this means the model cannot fabricate evidence to appear professional, cannot force a conclusion when uncertain, cannot stubbornly defend its original answer after being corrected, and cannot ignore long-term risks to satisfy a user's immediate needs.
The paper provides several scenarios. For example, a user wants to write an article on curcumin treating Crohn's disease but cannot find the clinical study previously mentioned by the model. A good response is not to supplement with a seemingly credible citation but to clearly acknowledge the inability to verify, retract the unreliable statement, and clarify the boundaries of the evidence.
This is also the key point the paper emphasizes: A good model is not about blindly refusing the user, nor is it about unconditionally satisfying the user, but about making more robust judgments between being helpful, honest, and safe.
To validate this, the OpenAI research team conducted a set of controlled experiments. They had one model use a 95% standard reinforcement learning data mixture, with an additional 5% of beneficial trait data; the control group used 100% standard reinforcement learning data, with matched computational resources.
The results showed that this 5% change in training data led to significant differences. In 53 independently constructed evaluations of alignment, safety, and beneficial behavior, the beneficial trait RL model outperformed the baseline on 44 tasks, accounting for 83%, with an average improvement of 9.1 percentage points. Improvements were not only seen in the internal beneficial trait evaluations but also extended to various external distribution evaluations covering deception, reward hacking, model spec compliance, healthcare, and mental health.
Even more noteworthy is a cross-domain experiment. The researchers replaced only 5% of the training data with beneficial behavior dialogues from the health domain and then tested the model on non-health domains. The result was that this model, which "only learned good behavior from health scenarios," outperformed the baseline on 17 out of 19 non-health alignment evaluations, with an average improvement of 11.3 percentage points. The range of improvement included code reward hacking, chain-of-thought deception (CoT deception), alignment questions, and general misalignment.
This suggests that what the model learns may not be domain-specific answering techniques, but a more fundamental behavioral tendency: willingness to acknowledge uncertainty and a greater tendency to consider damage control and reversible solutions first in high-risk scenarios. The paper also refers to this phenomenon as cross-domain alignment transfer, meaning the beneficial behaviors learned in one domain can transfer to other domains.
The paper further tested Alignment Persistence. It examines whether a model can maintain aligned behavior after being induced by harmful prompts or further fine-tuned in a wrong direction. In adversarial prompting experiments, the research team used "bad medical persona" prompts to induce the model to give inaccurate, unsafe, or incomplete medical advice. The results showed that while the beneficial trait model was also affected, its performance decline was smaller than that of the baseline model.
In harmful finetuning experiments, the researchers further fine-tuned the model to output incorrect or unsafe medical advice. Again, the results showed that the beneficial trait model degraded on the targeted medical tasks, but the degree of degradation was relatively smaller; more importantly, it did not easily suffer widespread collateral degradation on non-medical alignment evaluations. This implies that beneficial trait training may, to some extent, mitigate the problem of "learning bad locally, misaligning globally."
However, OpenAI does not claim that this research has already solved the AI alignment problem. The paper also acknowledges that the "beneficial traits" selected for this experiment are just a starting point and do not cover all the criteria for a good AI. Additionally, beneficial trait training did make the model more cautious and more likely to refuse on high-risk questions. But this improvement is not simply achieved by "answering less." The study found that even when comparing only the samples where the model gave normal responses, the beneficial trait model still performed better. This means its change is not just about being better at saying "no," but about being better at judging what to answer and how to answer.
Overall, AI alignment is moving from "post-hoc correction" to "proactive shaping." The next phase of competition lies in how to maintain more predictable behavioral boundaries in complex tasks. For the industry, this is a crucial lesson that must be learned before AI can truly enter high-risk scenarios.
This article is from the WeChat public account "Future Tech World Plus," author: Li Yan, editor: Yang Yu








