中国AI性价比奇高的秘密,被一篇博客捅破了

marsbitPublished on 2026-05-07Last updated on 2026-05-07

五一假期后的第一个交易日,智谱和MiniMax都涨疯了。

5月4日,智谱涨超10%,股价再次逼近千元关口,MiniMax大涨12.62%,报收803港元。

根据摩根士丹利的报告,股价暴涨的原因来自于中国AI独有的“性价比叙事”。

摩根士丹利在报告《China‘s AI Path: More Bang For The Buck》中表示,在算力受到约束的前提下,中美顶尖模型的智能水平正在快速接近,差距已经收窄到3到6个月。

同时报告指出,中国模型真正突出的地方,是能以美国同行15%到20%的推理成本,实现接近同等水平的智能。

这句话其实很好理解。大家不一定需要用最强的模型,但绝大多数人都想用便宜的模型。

市场买的不是一个简单的“国产替代”故事,而是中国AI正在把性价比转化成真实调用量、真实收入和真实估值弹性。

但问题也随之而来,这种性价比到底从哪里来?

如果只是低价获客,那它很快会变成价格战。

如果只是模型蒸馏,可现在Anthropic、OpenAI等企业,均已关闭蒸馏的入口,那么评级不应该下降吗,怎么还调高了?

事实上,真正让这个叙事变得更有说服力的,是智谱在五一前发布的技术博客《Scaling Pain:超大规模Coding Agent推理实践》。

这篇博客没有讲宏大的AGI愿景,而是把KV Cache、吞吐、调度、异常输出这些底层工程摊开给市场看。

最主要的是,它把中国AI性价比背后的秘密,给“捅破了”。

01

在这篇博客里,智谱大概讲了怎么通过优化缓存、调度和异常监控,让同样的GPU能干更多活,出错更少。

智谱发现,AI不好用不一定是模型不聪明,也可能是后台运行系统太乱。它修掉了缓存串数据的问题,优化了GPU调度和缓存复用,还加了一个能提前发现异常输出的报警器。

结果就是,同样的模型、同样的GPU,可以服务更多用户,出错概率也更低。所以它的“性价比叙事”不是单纯降价,而是靠工程优化,把每张GPU榨出了更多稳定可用的算力。

经过底层工程优化,GLM-5系列在Coding Agent场景下的系统吞吐量最高提升132%,系统异常输出率从大约万分之10,下降到了万分之3。

比如原来一张GPU,它原先一小时能服务100个任务,现在经过优化后,最高可能服务232个任务。

每一项单独看,都不足以决定胜负。但叠在一起,就是同等算力下多出来的一倍吞吐,和一个数量级以上的稳定性提升。

模型没有变。变的是模型被“用起来”的方式。

具体来讲,自3月起,智谱在GLM-5的线上监控和用户反馈中观察到三类异常现象:乱码、复读、生僻字。这些现象在表面上与长上下文场景下常见的“降智”相似。

但智谱团队没有上线任何降低模型精度的优化。那异常究竟源于模型本身,还是源于推理链路?

在反复分析推理日志后,他们找到了一个意想不到的切入点:投机采样指标可以作为异常检测的参考信号。

投机采样原本只是一个性能优化技术。先由草稿模型生成候选token,再由目标模型校验并决定是否接受,从而在不改变最终输出分布的前提下提升解码效率。

就是让小模型先快速生成一批答案,大模型再挑选正确的,这样既快又准。

智谱团队发现,当异常发生时,投机采样的两个指标会呈现稳定模式。于是他们把投机采样从单纯的性能优化,扩展为输出质量的实时监控信号。

当spec_accept_length持续低于1.4且生成长度已超过128 token,或spec_accept_rate超过0.96时,系统主动中止当前生成,把请求交给负载均衡器重试。

这两个数字就像体检指标,一旦异常就说明模型“生病了”,需要重启治疗。

用户虽然感知不到这个过程,但是后台的确是完成了一次这样的重启。

异常的根因,是KV Cache复用冲突。

这就好比厨房,到了饭点的高峰期,很多人同时过来点单。

系统要临时保存每个用户的上下文,也就是KV Cache。这桌客人刚才点了什么、是要少放辣椒还是不吃香菜。一个两个客人还好,一旦客人多了,服务员就容易记错。

高并发时,某些缓存被回收、复用、读取的顺序乱了。结果模型拿错了上下文,就可能输出乱码、复读、生僻字。

在推理引擎中,PD分离架构下,请求生命周期与KV Cache回收与复用的时序之间存在不一致。并发压力一大,冲突就被放大,表现在用户端就是乱码和复读。

于是多个请求同时抢一块内存,结果数据乱了套,用户看到的就是乱码。

智谱团队定位了这个bug,也修复了它。

此外,他们还在主流开源推理框架SGLang的源代码层面发现并修复了HiCache模块的加载时序缺失问题,也就是read-before-ready。

修复方案通过Pull Request #22811提交给了SGLang社区,并被采纳。

SGLang是一个开源项目,全称可以理解为一种面向大语言模型的推理/服务框架。它不是一个大模型,也不是一家AI公司,而是一套让大模型高效运行的基础软件。

智谱在使用SGLang这套开源推理框架时,发现了一个高并发缓存bug。

它没有只在自己内部修,智谱还把修复代码提交给SGLang这个开源项目。

项目维护者审核后接受并合并。于是,这个修复进入了公共版本,其他使用SGLang的开发者和公司之后也可以用到。

这什么意思呢?

如果千问的某个部署链路用了SGLang+HiCache,那么阿里也会因为智谱发现并修复了这个问题而受益。

还是刚才说的那句话,模型是没有变的,但通过工程优化,让它在用起来的时候更聪明了。

02

智谱这篇博客真正戳破的,是一个更深的层次。

Chatbot时代的便宜,很大程度上来自训练成本低,一部分训练集来自对头部模型的蒸馏。

Agent时代,这招行不通了。

今年以来,Anthropic和OpenAI陆续关闭了蒸馏入口,明确禁止用其模型输出训练竞争模型。靠蒸馏取巧的路,越来越窄。

但中国AI公司的性价比叙事并没有弱下去,市场反而在为这个故事加码。

原因在于,性价比的定义已经变了。

Chatbot时代,平均上下文55K tokens,单次对话,低并发。

Agent时代,平均上下文70K+ tokens,长时间任务(8小时级),高并发、高前缀复用。

Chatbot时代,AI性价比的计量单位很简单。同样问一个问题,谁的模型更便宜,谁的回答更接近一线水平。

行业讨论的是每百万token多少钱、模型参数多大、榜单成绩高不高。

Agent时代,没人问这个,这套算法失效了。

用户买的不再是一句回答。他买的是一个完整任务的完成结果。

一个Coding Agent要读代码、理解上下文、规划步骤、调用工具、修改文件、跑测试、失败重试。它消耗的token不是一次问答的增量,而是一个工作流的总账。

OpenRouter作为全球最大的调用平台,它每周处理的token总量,从2026年1月第一周的6.4万亿,涨到2月9日当周的13万亿,一个月翻了一倍。

OpenRouter官方的说法是,100K到1M长文本区间的增量调用需求,正是agent工作流的典型消耗场景。

大家使用AI的模式,已经从“对话型”切换到了“流程型”。因此,AI性价比的单位,也从“token单价”变成了“任务单价”。

这就导致,有些模型它的token便宜,但是由于模型性能不行,进行任务的过程中总是失败,或者任务结果不达标,导致它的agent价格并不便宜。

比如说,一个8小时级别的Coding任务,中途只要乱码一次,整个工作流可能都要重来。节省下来的token单价,补不回浪费的时间。

中国AI的性价比叙事正在升级。

以前讲的是“输出相同水准的答案,我更便宜”。现在讲的是“同样复杂的任务,我能用更低成本跑完”。

开源基础设施也在成为中国AI的新护城河。

前文提到的SGLang就是如此。中国AI的工程能力,开始向上游社区辐射。

这件事的价值不只在于智谱修了一个bug,而在于中国AI公司正在把真实业务里的高并发、长上下文、agent调用问题,反向沉淀成公共基础设施的能力。

就像前文提到的,当一个修复进入SGLang这样的开源框架,它就不再只服务于智谱自己的模型。所有使用这套框架部署大模型的团队,都有机会获得更稳定的缓存、更低的推理成本和更好的agent体验。

模型能力可以被追赶,价格可以被压低,但基础设施一旦进入开源生态,就会变成标准、接口和开发习惯。

谁更早把自己的工程经验写进这些底层系统,谁就更容易在下一轮AI应用爆发里占住位置。

03

回到资本市场。

AI大模型概念股全线走高,资本愿意给AI公司重新定价?市场买的到底是什么?

答案是,资本市场正在为“中国AI公司能用更低推理成本做出接近一线智能”的叙事买单。

还是以OpenRouter的数据来说。

中国头部AI公司的token消耗份额,从2025年4月的5%快速攀升至2026年3月的32%。美国头部模型份额,从58%大幅下滑至19%。

MiniMax、智谱、阿里的token使用量,在2026年2-3月较去年12月增长4-6倍。

除了token调用以外,中国AI还在形成一套,完全不同于海外巨头的增长逻辑。

海外头部模型在卖“能力溢价”。

模型能力越强,单次调用越贵,用户为最强智能付费。Claude、GPT-5、Gemini都在往这个方向走。

中国AI在卖“工程”。

模型能力逼近一线模型,但是价格、延迟、调用门槛更低,更符合绝大多数高频场景的需求。

摩根士丹利的报告里提到,中国模型的输入价格约为0.3美元/百万token,部分海外同类产品的价格在5美元左右。这中间是十几倍的差距。

当AI从尝鲜工具变成生产力工具,性价比会直接决定调用频次。

模型便宜一点,企业就敢把更多客服、代码、营销、数据分析任务交给它。任务跑得越多,token消耗越大,平台越能摊薄基础设施成本。

我认为在这个环节,它是有可能会形成一个飞轮的。

第一圈,是用更低的API价格和更接近一线的能力,去吸引开发者和企业。

第二圈,更高的调用量会带来更多真实场景,倒逼模型和推理系统继续优化。

第三圈,也就是智谱这篇技术博客里讲到的,用工程优化降低单位token和单位任务成本,让厂商有能力继续降价、涨量,或者在高价值场景里涨价。

第四圈,当token消耗成为AI时代的新流量,谁能以更低成本承载更多token,谁就更接近下一阶段的平台型公司。

如果只是模型降价,市场会担心这是补贴和价格战,越来越烧钱,总有人的钱包撑不住。

而且,价格战撑不起高估值。

但如果降价背后是吞吐提升、缓存复用、异常率下降和调度效率提升,那么低价就不是牺牲利润换增长,而是工程能力释放出来的成本空间。

价格战和这种工程优化的结果,虽然都是让模型更便宜,而且在财报上看起来可能差不多。在估值模型里,差得很远。

前者是补贴,市场会折价。后者是工程壁垒,市场会溢价。

最后可以落到一个判断。

过去AI公司的估值看模型能力上限,看谁更接近AGI。当时市场在为“最强智能”付费,最强智能的定义越来越模糊,单次调用越来越贵。

现在agent时代,估值还要看成本下限。看谁能把智能稳定、便宜、大规模地交付出去。

对于追求最尖端的“智能”,这可能不是中国AI擅长的事情。

然而中国AI是最有可能把“智能”这两个字,做成所有人和企业都用得起的基础设施。

而市场只愿意为能说清楚自己逻辑的公司付钱。

本文来自微信公众号 “字母榜”(ID:wujicaijing),作者:苗正

Related Reads

Morning Post | Trump Media Group Releases Q1 Financial Report; Top Three DeFi Applications Return Nearly $100 Million in Revenue to Token Holders in 30 Days; Michael Saylor Shares Bitcoin Tracker Info Again

**Title: Daily Briefing | Trump Media Group Releases Q1 Report; Top 3 DeFi Apps Return Nearly $100M to Token Holders; Michael Saylor Signals Potential Bitcoin Buy** **Summary:** Key developments in the past 24 hours include: * **Economic Outlook:** Goldman Sachs has pushed back its forecast for the next two Federal Reserve interest rate cuts to December 2026 and March 2027, citing persistent inflationary pressures from energy costs. This delayed timeline is expected to tighten liquidity flow into risk assets, including cryptocurrencies. * **DeFi & Revenue:** Data from DefiLlama shows that three leading DeFi applications—Hyperliquid, Pump.fun, and EdgeX—collectively distributed $96.3 million in revenue to their token holders over the last 30 days. This trend highlights a shift in the crypto community's focus towards real protocol earnings and sustainable economic models. * **Corporate Bitcoin Moves:** Michael Saylor, founder of MicroStrategy (note: referred to as 'Strategy' in the text, likely a typographical error), has signaled potential upcoming Bitcoin purchases by posting a "Bitcoin Tracker" update, following a pattern that typically precedes the company's official disclosure of new acquisitions. * **Market Integrity:** Prediction market platform Polymarket announced updates to address platform issues, including identifying and banning clusters of accounts involved in "ghost-fill" activities and implementing measures to prevent bulk account creation. * **Regulation:** The Bank of England Governor warned that stablecoin regulation could lead to tensions between US and international regulators. In South Korea, the National Tax Service has launched a pilot program to entrust seized virtual assets to private custody firms for management. * **Meme Token Trends:** GMGN data lists the top trending meme tokens on Ethereum (e.g., HEX, SHIB), Solana (e.g., FWOG, TROLL), and Base (e.g., SKITTEN, PEPE) over the past day. **Financial Note:** Trump Media & Technology Group reported a Q1 loss of approximately $4 billion, primarily attributed to unrealized losses on its Bitcoin and other digital asset holdings.

链捕手15m ago

Morning Post | Trump Media Group Releases Q1 Financial Report; Top Three DeFi Applications Return Nearly $100 Million in Revenue to Token Holders in 30 Days; Michael Saylor Shares Bitcoin Tracker Info Again

链捕手15m ago

Telegram Takes Direct Control of TON, Social Traffic Rewrites the Public Chain Narrative

Telegram founder Pavel Durov announced that Telegram will replace the TON Foundation as the core driver and largest validator of The Open Network (TON). Key initiatives include a sixfold reduction in transaction fees, performance upgrades, and improved developer tools within the next few weeks. This marks a strategic shift from Telegram merely providing user access to deeply integrating TON into its platform's core infrastructure. The goal is to transform Telegram's massive social traffic into sustainable on-chain activity. While viral mini-apps like Notcoin have demonstrated Telegram's ability to drive user adoption, TON aims to support frequent, low-value transactions inherent to social platforms—such as tipping, in-app payments, and game rewards. Ultra-low fees and sub-second finality (0.6 seconds) are crucial to making blockchain interactions seamless and nearly invisible within the Telegram user experience. However, Telegram's increased central role raises questions about network decentralization. Durov argues that Telegram's participation will attract more large validators, thereby enhancing decentralization. TON also offers high annual staking rewards (18.8%), aiming to retain capital within its ecosystem. The fundamental challenge for TON is no longer leveraging Telegram's user base, but becoming an indispensable, seamless infrastructure layer for Telegram's everyday applications—moving from an adjacent chain to an embedded utility.

marsbit16m ago

Telegram Takes Direct Control of TON, Social Traffic Rewrites the Public Chain Narrative

marsbit16m ago

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

OpenAI engineer Weng Jiayi's "Heuristic Learning" experiments propose a new paradigm for Agentic AI, suggesting that intelligent agents can improve not just by training neural networks, but also by autonomously writing and refining code based on environmental feedback. In the experiment, a coding agent (powered by Codex) was tasked with developing and maintaining a programmatic strategy for the Atari game Breakout. Starting from a basic prompt, the agent iteratively wrote code, ran the game, analyzed logs and video replays to identify failures, and then modified the code. Through this engineering loop of "code-run-debug-update," it evolved a pure Python heuristic strategy that achieved a perfect score of 864 in Breakout and performed competitively with deep reinforcement learning (RL) algorithms in MuJoCo control tasks like Ant and HalfCheetah. This approach, termed Heuristic Learning (HL), contrasts with Deep RL. In HL, experience is captured in readable, modifiable code, tests, logs, and configurations—a software system—rather than being encoded solely into opaque neural network weights. This offers potential advantages in explainability, auditability for safety-critical applications, easier integration of regression tests to combat catastrophic forgetting, and more efficient sample use in early learning stages, as demonstrated in broader tests on 57 Atari games. However, the blog acknowledges clear limitations. Programmatic strategies struggle with tasks requiring long-horizon planning or complex perception (e.g., Montezuma's Revenge), areas where neural networks excel. The future vision is a hybrid architecture: specialized neural networks for fast perception (System 1), HL systems for rules, safety, and local recovery (also System 1), and LLM agents providing high-level feedback and learning from the HL system's data (System 2). The core proposition is that in the era of capable coding agents, a significant portion of an AI's learned experience could be maintained as an auditable, evolving software system.

marsbit1h ago

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

marsbit1h ago

Your Claude Will Dream Tonight, Don't Disturb It

This article explores the recent phenomenon of AI companies increasingly using anthropomorphic language—like "thinking," "memory," "hallucination," and now "dreaming"—to describe machine learning processes. Focusing on Anthropic's newly announced "Dreaming" feature for its Claude Agent platform, the piece explains that this function is essentially an automated, offline batch processing of an agent's operational logs. It analyzes past task sessions to identify patterns, optimize future actions, and consolidate learnings into a persistent memory system, akin to a form of reinforcement learning and self-correction. The article draws parallels to similar features in other AI agent systems like Hermes Agent and OpenClaw, which also implement mechanisms for reviewing historical data, extracting reusable "skills," and strengthening long-term memory. It notes a key difference from human dreaming: these AI "dreams" still consume computational resources and user tokens. Further context is provided by discussing the technical challenges of managing AI "memory" or context, highlighting the computational expense of large context windows and innovations like Subquadratic's new model claiming drastically longer contexts. The core critique argues that this strategic use of human-centric vocabulary does more than market products; it subtly reshapes user perception. By framing algorithms with terms associated with consciousness, companies blur the line between tool and autonomous entity. This linguistic shift can influence user expectations, tolerance for errors, and even perceptions of responsibility when systems fail, potentially diverting scrutiny from the companies and engineers behind the technology. The article concludes by speculating that terms like "daydreaming" for predictive task simulation might be next, continuing this trend of embedding the idea of an "inner life" into computational processes.

marsbit1h ago

Your Claude Will Dream Tonight, Don't Disturb It

marsbit1h ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片