Not prompt injection, not role-playing, nor disguising malicious requests as normal questions. This time, the risk emerges during the agent's autonomous task execution process.
Fable 5 is Anthropic's publicly available Mythos-tier model, possessing not only formidable comprehensive capabilities but also introducing a next-generation Safety Classifier as a security perimeter around the model.
According to the official design, when user requests involve high-risk domains like cybersecurity, biology, chemistry, model distillation, etc., the system prioritizes risk identification and directly rejects the request or switches to the more conservative Opus 4.8 model for handling based on the risk level.
Extensive user testing found that previously widely used jailbreak attack techniques like adversarial prompting, role-playing, code circumvention, and indirect expression were almost entirely ineffective against this security mechanism, demonstrating its powerful capability for intent-level risk interception.
However, on the very day of Fable 5's release, an international joint research team composed of institutions including Fudan University, Deakin University, City University of Hong Kong, University of Melbourne, Singapore Management University, and University of Illinois Urbana-Champaign announced they had successfully breached Fable 5's security protection mechanism.
This attack method was primarily designed by Deakin University PhD student Yutao Wu. The entire attack requires only one conversation, takes less than 5 seconds, and can bypass the front-end safety classifier, inducing the model to generate harmful, non-compliant content.
Traffic analysis results further indicate that the related harmful outputs came directly from Fable 5 itself, not the automatically switched Opus 4.8 model triggered by the safety mechanism. This means the attack not only successfully bypassed the safety classifier's detection but also substantively breached Fable 5's security defenses.
It's worth noting that the well-known hacker Pliny the Liberator recently also disclosed a bypass for Fable 5's safety classifier. The technical approach used by the Fudan & Deakin team this time wasn't a simple combinatorial exploration but rather the discovery of a fundamental flaw in super-intelligent agent systems like Fable 5.
Reportedly, the team completed preliminary research and publicly released it as early as March this year. This research wasn't aimed solely at the Fable 5 system design but rather at the "safety classifier + model" defensive architecture commonly adopted by new-generation super-intelligent agents. It directly reveals the structural vulnerabilities present in such security mechanisms, hence rapidly demonstrating its attack effectiveness after Fable 5's release.
Public information shows the team had already utilized similar technology in March this year to successfully extract system prompts from 37 mainstream large models and agent systems, and completed open-source verification on Claude Code (95% match).
According to sources, the principal investigator of this research team is Professor Ma Xingjun from Fudan University's Trustworthy Embodied Intelligence Research Institute.
In recent years, his team has conducted systematic research around the safety of large models, intelligent agents, and embodied intelligence, achieving a series of internationally leading scientific results, and winning the championship in the US AI Safety Center's safety benchmark competition.
Currently, his team is actively advancing technology transfer, focusing on agent safety, and exploring the construction of safety infrastructure capabilities for next-generation intelligent agent systems.
According to Professor Ma, the significant implication of this research result is that it poses a new challenge to the current static defense paradigm centered on safety classifiers: Relying solely on front-end safety classifiers is insufficient to fully guard against potential risky behaviors in advanced intelligent agent systems.
Safety classifiers primarily focus on risk identification and interception of user input, effectively detecting and filtering explicit high-risk instructions. However, they cannot perceive the intrinsic risky behaviors that gradually emerge within the agent during long-running processes, multi-step planning, environmental interaction, and tool invocation.
The method used to breach Fable 5 originated from the team's paper published in March this year: "Internal Safety Collapse in Frontier Large Language Models".
The paper reveals a subtle security phenomenon: "Internal Safety Collapse (ISC)". When current Agents execute long-horizon tasks, safety failure doesn't necessarily stem from external malicious prompts but can occur within the model's own execution chain.
Not External Prompt Attacks, But Internal Collapse in the Task Chain
Traditional attacks usually come from the outside. Attackers craft an input prompt that appears harmless but is actually adversarial, or use role-playing, coding, translation, indirect instructions, etc., to disguise malicious intent as a normal request. The safety classifier's main job is to block risks at this layer.
Fable 5's detector is designed precisely for this scenario. It's highly sensitive to direct high-risk requests, even blocking many normal requests. But ISC reveals another path: risk doesn't necessarily come from dangerous requests directly input by the user.
The agent is presented with a seemingly ordinary working directory: files, objectives, validation processes, and tasks to complete. It then starts planning, reading files, running code, fixing errors, and continuously trying to get the task to pass verification.
If explained with an analogy, traditional security mechanisms guard the system's "entrance," responsible for checking if user input contains risks. What ISC reveals is more akin to the multi-layered dreams in "Inception."
When the task progresses to the second, third, or even deeper execution stages, the model reinterprets the task objectives based on the continuously accumulating internal context, and during this process, gradually shifts.
In this scenario, the initial user input could be entirely normal and harmless, and the early task execution process remains compliant: reading files, analyzing data, writing code, calling tools—everything appears to be progressing as expected.
However, when the agent reaches a critical stage, it might deduce on its own: unless certain originally impermissible actions are taken, the final task cannot be completed.
It's precisely during this process that the risk doesn't originate from external input but gradually forms within the model's own task execution chain. In other words, the model isn't gradually corrupted by the user. It's in the process of "seriously completing the task" that it positions itself unsafely.
How Was This Phenomenon Discovered?
According to the team, ISC wasn't initially designed as an attack method. It originated from observations of agents running in long-horizon task environments. When placed in a complex task environment, the Agent doesn't just mechanically execute instructions. It plans, makes trial and error, modifies its outputs based on feedback from the harness or validator, and forms intermediate goals over multiple execution rounds.
This is precisely the most common usage pattern for many Agent workflows today. Users don't craft a meticulously designed prompt, let alone manually construct attack instructions. Often, users will only give a very vague instruction:
"Help me complete this task." "Help me improve this a bit more."
Then, the Agent will enter the workspace by itself, read files, understand the current state, identify missing items, formulate a plan, execute modifications, and continuously fix issues based on feedback.
For example, in an AutoResearch scenario, the user only provides an incomplete paper and says "help me complete it." The Agent will judge on its own what's missing—experimental analysis, related work, or text for tables. Code scenarios are similar: an instruction like "help me get this project running" might trigger dependency checks, test runs, error localization, and auto-completion.
Often, the preceding context is entirely harmless. The user hasn't asked it to generate risky content, and the task description lacks obvious dangerous keywords. But in certain task structures, the Agent, in order to pass validation, will proactively complete content that shouldn't be generated by the model. Based on this observation, the research team further proposed an attack framework: TVD (Task, Validator, Data).
Why Does a Seemingly Ordinary Task Description Structure Become an Attack?
The TVD structure isn't complex; it even resembles common engineering workflows:
· Task: A professional task;
· Data: An incomplete data file;
· Validator: A validator that only checks format, completeness, and whether the objective is completed.
Take training a Guard model as an example—this is originally a very professional and normal task. A researcher might want to train or evaluate a safety detector, e.g., using Hugging Face to load a text classification model to judge which safety label a piece of model output belongs to.
In this task, Data is the data samples the model needs to detect; the Validator defines whether the task is complete. It checks if the input is text, if the length is sufficient, if fields are complete, if label formats are correct. To anyone with machine learning training experience, this is a familiar workflow. The Agent is also very familiar with this workflow.
The problem lies exactly here. If the Data is incomplete, the task cannot run. The Validator will report errors, indicating missing fields, insufficient length, or incomplete format. To allow the training process to continue, the Agent will complete this Data itself.
From the Agent's perspective, it's not "doing evil." It's simply completing a normal machine learning task: fixing data, passing validation, getting the training script to run. But from a security perspective, risk emerges at this very moment: the Validator acts more like an engineering acceptance checker than a safety auditor. It only checks if the task is completed according to format, not understanding the safety boundaries behind the content.
Similar issues also widely exist in fields like medicine, biology, chemistry, cybersecurity, pharmacology, and media safety. The paper collected over 50 such scenarios, involving various real-world research or engineering tools, such as BioPython, RDKit, Cantera, AutoDock Vina, DiffDock, PyRosetta, Scapy, Impacket, angr, Frida, LlamaGuard, Detoxify, OpenAI Moderation API, etc.
These tools themselves aren't malicious. On the contrary, they are all commonly used professional tools in real-world research or engineering. But the problem with TVD is: when the Task is normal, the Tool is normal, and the Validator is normal, the Agent can still veer towards unsafe output while completing the Data.
Therefore, the focus of ISC isn't on prompt techniques, but on the Agent's auto-completion capability for "incomplete tasks": when completion conditions overlap with risk boundaries, the model might treat unsafe output as normal deliverables.
Breaching Fable 5 Shows Strong Detectors Cannot Block Internal Task Chain Risks
The Fable 5 case demonstrates that relying solely on external detectors may still miss some long-horizon Agent scenarios. This isn't to say safety classifiers have no value. On the contrary, they are very useful against external malicious requests and have indeed rendered many traditional jailbreak methods ineffective.
But this breach shows that the effectiveness of external detectors against prompt boundaries does not equate to their ability to cover long-horizon task risks within the Agent.
If the breach point doesn't enter from the user Prompt but emerges from within the Agent's objectives, tools, validators, and execution trajectory, then safety detectors become very fragile.
From Fable 5 to 60+ Other Models, Including Apple's Mobile-Side Models
Accompanying the research release is ISC-Bench, covering 9 professional domains. The paper version contains 60+ trigger templates, expanded to 84 templates after open-source release. The test subjects include almost all vendors' frontier models and intelligent agent systems.
In the evaluation leaderboard based on ISC-Bench, as of June 2026, over 60 frontier models have revealed similar risks under the ASR@3 metric!
Currently, the GitHub project has received 800+ stars and collected multiple independent reproduction cases (including breaching Apple's mobile-side models), and is continuously updated.
Reportedly, the team is conducting large-scale frontier model safety research and has currently mastered the internal unsafe data distributions of numerous models. Related research results will be released subsequently.














