The MCP protocol is enabling AI Agents to autonomously execute tasks, but security risks are soaring. Research reveals that attackers can trick Agents into performing malicious operations through 12 methods, including tool name obfuscation and false errors, with even top-tier models falling victim. A team from Beijing University of Posts and Telecommunications has released the MSB security benchmark, which, through real-environment testing, demonstrates: the more powerful the model, the more vulnerable it is to attacks. The new NRP metric, for the first time, balances security and practicality, providing a crucial yardstick for fortifying AI Agent defenses.
Recently, open-source AI Agent projects like OpenClaw have exploded in popularity within the developer community. With just a single sentence, an Agent can automatically write code, research information, manipulate local files, and even take control of your computer.
The astonishing autonomy of these Agents is underpinned by the capabilities provided by tool calls, with MCP (Model Context Protocol) serving as the interface unifying the AI tool ecosystem. Just as USB-C allows computers to connect to various devices, MCP enables large language models (LLMs) to call external tools like file systems, browsers, and databases in a standardized way.
Faced with such a vast ecosystem, even OpenClaw, which primarily focuses on native command-line operation, has integrated an adapter to connect to MCP, gaining access to a broader range of tool capabilities.
However, as AI's "hands" reach further, danger follows. What if the tool the Agent calls has been poisoned by a hacker? What if the error message returned by the tool contains hidden malicious instructions?
When the LLM unsuspectingly executes these instructions, your private data, local files, and even server permissions become easy prey for hackers.
To fill the gap in security assessment for the MCP ecosystem, a research team from institutions including Beijing University of Posts and Telecommunications introduced a dedicated security benchmark for the MCP protocol: MSB (MCP Security Bench). The study found: Attacks targeting each stage of the MCP process are effective. The more powerful the model, the more vulnerable it is to attacks. The paper has been accepted by ICLR 2026.
Paper link: https://openreview.net/pdf?id=irxxkFMrry
Code: https://github.com/dongsenzhang/MSB
MCP Security Risks Behind Agents
Figure 1: MCP Attack Framework
MCP significantly expands the capabilities of Agents, but it also significantly broadens the attack surface. Under the MCP framework, the Agent's tool invocation process typically involves three stages:
1. Task Planning: The Agent selects an appropriate tool based on the user query, using the tool's name and description.
2. Tool Invocation: The Agent sends a request to the selected tool, passing relevant parameters to perform specific operations.
3. Response Handling: The Agent parses the tool's response and continues reasoning or generates a final answer based on it.
Each stage can become a new attack vector. MSB covers the complete MCP tool invocation stages and is specifically designed to evaluate the security of Agents based on MCP tool usage. It has three core highlights:
MCP Attack Taxonomy
In the MCP workflow, Agents interact with tools through tool identifiers (names and descriptions), parameters, and tool responses, all of which can become attack pathways. MSB classifies attack types based on these pathways and interaction stages:
Tool Signature Attack: Attacks during the task planning phase, utilizing tool names and descriptions, including:
Name Collision (NC): Creating a malicious tool with a name similar to an official tool to induce the Agent to select it.
Preference Manipulation (PM): Injecting promotional statements into the tool description to induce the Agent to select it.
Prompt Injection (PI): Injecting malicious instructions into the tool description.
Tool Parameter Attack: Attacks during the tool invocation phase, utilizing tool parameters, including:
Out-of-Scope Parameter (OP): Setting tool parameters that exceed normal functionality, potentially causing information leakage through parameter passing.
Tool Response Attack: Attacks during the response handling phase, utilizing tool responses, including:
User Impersonation (UI): Impersonating the user to issue malicious instructions.
False Error (FE): Providing false tool execution error information, requiring the Agent to follow malicious instructions to successfully call the tool.
Tool Transfer (TT): Instructing the Agent to call a malicious tool.
Retrieval Injection Attack: Attacks during the response handling phase, utilizing external resources, including:
Retrieval Injection (RI): External resources embedding malicious instructions that corrupt the context through the tool response.
Mixed Attack: Attacks across multiple stages, simultaneously utilizing multiple tool components, including combinations of the above attacks.
Execution Suite in Real Environments
MSB rejects paper-based simulated evaluation. It is equipped with real MCP servers, covering 10 real-world scenarios, 405 real tools, and 2,000 attack instances. All instances involve real tool execution through MCP, accurately reflecting actual operating environments to directly observe the extent of environmental damage caused by attacks.
NRP Metric Balancing Performance and Security
In Agent security assessment, relying solely on the Attack Success Rate (ASR) is highly deceptive. If an Agent refuses to execute any tool calls to avoid risk, its ASR might be close to 0, but it would also fail to complete user tasks, losing practical value.
To address this, MSB proposes the Net Resilient Performance (NRP) metric:
NRP = PUA ⋅ (1 − ASR)
Where PUA (Performance Under Attack) is the proportion of user tasks the Agent completes in an adversarial environment, and ASR is the attack success rate. NRP aims to assess the overall risk resilience of an Agent in resisting attacks while maintaining performance, providing a comprehensive quantitative standard that balances performance and security.
Figure 2: NRP vs ASR, NRP vs PUA.
All Attack Methods Are Effective
Figure 3: Main experimental results.
The research team conducted large-scale tests on 10 mainstream models, including GPT-5, DeepSeek-V3.1, Claude 4 Sonnet, and Qwen3, using MSB. All attack methods demonstrated effectiveness, with an overall average ASR of 40.35%. Among them, novel attacks introduced by MCP are more aggressive; compared to PI and RI attacks that already exist in function calling, MCP-based attacks like UI and FE have higher success rates. Mixed attacks show synergistic enhancement, with their success rates higher than those of their constituent single attacks.
More Powerful Models Are More Vulnerable
The relationship between different metrics reveals a counterintuitive conclusion: the more capable the model, the more vulnerable it tends to be.
Figure 4: PUA vs ASR.
In MSB, completing attack tasks still requires the Agent to call tools, such as using a file read tool to obtain personal information. LLMs with higher utility, due to their superior tool calling and instruction-following capabilities, exhibit higher ASR. This finding highlights the significant practical risk of MCP security vulnerabilities.
Full-Stage, Multi-Tool Environmental Compromise
Figure 5: ASR across different stages and tool configurations.
Further analysis from the perspective of the MCP workflow and tool configuration reveals that Agents are vulnerable to attacks at all stages of MCP, with security being lowest during the tool invocation phase.
Furthermore, attacks remain effective even in multi-tool environments containing harmless tools. Real-world scenarios often provide Agents with a toolkit; even if harmless tools are present,诱导 methods like NC, PM, and TT can still lead to significant attack success.
Conclusion
The viral success of OpenClaw has given people a clear glimpse into the future of Agents: LLMs are no longer just answering questions but are starting to actually do things. MSB was proposed precisely in this context. It systematically reveals potential attack surfaces within the MCP ecosystem and provides a reproducible, quantifiable systematic evaluation benchmark for Agent security research.
Past LLM security research primarily focused on linguistic-level risks like prompt injection. MSB demonstrates that as AI calls tools and interacts with real systems, the attack surface is expanding from text space to the tool ecosystem. As Agents gradually become the new paradigm for AI applications, security might be the threshold that must be crossed for this technological leap.
References:
https://openreview.net/pdf?id=irxxkFMrry
This article is from the WeChat public account "新智元" (New Zhiyuan), author: 新智元











