Anthropic Has Taught Models to Understand Morality and Opened a New Path for Distillation
Anthropic's research "Teaching Claude Why" reveals a new, data-efficient method for AI alignment. Instead of relying on massive reinforcement learning with punishment (RLHF), which only teaches models to mimic safe answers without true ethical understanding, they used a small dataset (3 million tokens) of "difficult advice." This data consisted of detailed moral deliberations, reasoning, and debates, teaching the model the *why* behind decisions.
The key was "deliberation-enhanced" Supervised Fine-Tuning (SFT). The model was trained on responses that included a "chain of thought" (CoT) process based on a constitutional framework. This framework included top-level principles, practical heuristics (like the "1000-user test"), and an 8-factor utility calculator (evaluating harm probability, reversibility, consent, etc.) for weighing complex trade-offs. This approach dropped model misalignment rates from 22% to 3% and showed strong generalization to unseen scenarios.
The success challenges the old belief that "SFT memorizes, RL generalizes." It shows that SFT can generalize powerfully if the training data has two features: 1) high prompt diversity (many different scenario types) and 2) CoT supervision (showing the reasoning steps, not just the final answer). The model learns the underlying *thinking framework*, not just surface-level behaviors.
This method points to a new paradigm for training AI in "non-RLVR" domains—areas like ethics, creative writing, or strategy where there's no single verifiable answer. The formula is: Domain Constitution + Heuristics + Multi-Factor Deliberation Framework + Diverse Deliberative CoT Data = Generalized capability. It represents a new form of "distillation," moving competition from pure compute towards who can best structure expert knowledge into high-quality reasoning datasets.
marsbitHace 20 hora(s)