OpenAI's New Paper: How to Train an AI that "Doesn't Deteriorate Under Pressure"?
OpenAI's new paper "Reinforcement Learning Towards Broadly and Persistently Beneficial Models" explores training AI to maintain safe, helpful, and honest behavior even under pressure, in unseen scenarios, or after being fine-tuned for harmful purposes.
Moving beyond simple rule-based "don'ts," the research focuses on cultivating "beneficial traits" like honesty, risk-awareness, corrigibility, and transparency. It investigates if reinforcement learning (RL), often prone to "reward hacking" where models exploit loopholes, can instead be used to instill robust, generalized positive behaviors.
Researchers created a multi-domain synthetic dialogue dataset covering areas like healthcare and law. They trained a model by replacing 5% of standard RL data with "beneficial trait" data. This model outperformed the baseline in 83% of 53 evaluations, showing average gains of 9.1% in alignment, safety, and helpfulness. Crucially, improvements generalized: a model trained only on healthcare "good behavior" data also performed better in 17 out of 19 non-healthcare alignment tests.
The paper also tests "alignment persistence." When subjected to adversarial prompts or harmful fine-tuning, the beneficial trait model showed greater resilience, with smaller performance drops and less "spillover" of bad behavior to unrelated tasks.
While not a complete solution, this work suggests a shift from post-hoc correction to proactively shaping robust, principled AI behavior, a critical step for deploying models in high-stakes, complex decision-making scenarios.
marsbit29 dk önce