AMD新论文颠覆认知：FP4训练不稳定，原因不是随机性不足

marsbitPublished on 2026-05-27Last updated on 2026-05-27

Abstract

AMD最新研究发现，FP4训练不稳定的主要原因并非此前认为的随机性不足，而是结构性微缩放误差在关键梯度路径上累积放大所致。过去，尝试使用FP4从头训练大模型常因训练不稳定而失败。AMD与宾夕法尼亚州立大学的论文通过实验证明，在Transformer的权重梯度计算路径上使用FP4量化会导致收敛质量显著下降。此前用于缓解量化误差的随机性策略（如随机舍入）在此场景下反而加剧了不稳定性。研究团队采用MXFP4数据格式，并引入确定性Hadamard旋转作为稳定化方法，成功在AMD MI355X GPU上完成了Llama 3.1-8B模型的全流程FP4预训练。结果显示，该方法在仅增加8-9%训练数据开销的情况下，实现了比FP8基线快9-10%的端到端训练速度。这项研究首次在原生FP4硬件上验证了低精度训练的可行性，为降低大模型训练成本提供了新方向，并指出结构性误差分析比增加随机性更为关键。基于开放标准OCP的MXFP4格式也增强了该方案在不同硬件平台间的可移植性。

众所周知,大模型训练成本极高。

但大家又知道,降低训练精度能够显著降低训练成本。DeepSeek-V3 用 FP8 训练把成本打到了 560 万美元,已经让全行业侧目。

在 FP8 成功后,行业仍然在不断探索低精度的边界:从 FP8 降到 FP4,训练成本还能再降多少?

理论上,FP4 的计算吞吐可以是 FP8 的两倍。NVIDIA Blackwell 和 AMD MI350 系列都已经在硬件层面原生支持了 FP4 运算,前者在 B200 上标称 FP4 算力可达 4500 TOPS(稀疏)。硬件已经准备好了,但软件和算法那一侧,一直卡在一个问题上:

用 FP4 从头训练大模型,训练过程非常不稳定。

过去两年里,LLM-FP4、NVFP4 预训练等工作陆续尝试了这条路,但鲜有方案能在 4 比特精度下干净利落地跑通全流程预训练,同时保持接近 FP8 的收敛质量。

更棘手的是,崩溃的原因一直不清楚,分析认为,FP4 训练不稳定的原因很可能来自随机性不足。

但就在最近,AMD 联合宾夕法尼亚州立大学发布了一篇论文,颠覆了传统的认知,为原生 FP4 训练给出了一个全新的清晰诊断。

论文标题:Pretraining large language models with MXFP4 on Native FP4 Hardware
论文链接:https://arxiv.org/abs/2605.09825

这篇论文在 AMD Instinct MI355X GPU 上,用 MXFP4 格式完成了 Llama 3.1-8B 的全流程预训练,端到端训练速度比 FP8 基线快 9-10%,token 开销仅多 8-9%。这是目前第一个在原生 FP4 硬件(非软件模拟)上完成大模型预训练的完整实验。

更重要的是,论文揭示了核心问题:FP4 训练的不稳定性的来源不是随机性不足,是结构性微缩放误差沿敏感梯度路径累积放大。

MXFP4 是什么

在拆解论文之前,有必要先理解 MXFP4 这个数据格式。

传统的整数量化通常对整个张量使用一个缩放因子。MXFP4 的核心设计叫「微缩放」(Micro-scaling):把一个张量切成小块(比如每 32 个元素一组),为每个小块分配一个共享指数(E8M0 格式),块内的每个元素用 4 比特浮点数表示。重建公式可以写成:

其中 E_shared 是块内最大指数,Q_FP4 是最近舍入到 4 比特浮点可表示值。

微缩放的好处在于:每个小块有自己的动态范围,不会被全局异常值「绑架」。这让 4 比特浮点数的表示质量比朴素的全局量化好很多。

但即便有了微缩放,FP4 训练依然不稳定。

排查实验:不稳定的根源

研究团队先设计了一个逐步排查的控制实验。

一次完整的 Transformer 线性层计算,涉及三个通用矩阵乘法操作:

Fprop(前向传播):计算 Y = XW^T,产出激活值

Dgrad(激活梯度):计算 ∇X = ∇Y · W,将梯度回传给输入

Wgrad(权重梯度):计算 ∇W = (∇Y)^T · X,产出用于更新权重的梯度

研究团队保持其他所有因素不变,逐步把这三个操作从 FP8 替换成 MXFP4,观察每一步对收敛的影响。所有实验都在 AMD Instinct MI355X 上用原生 FP4 tensor core 执行,不依赖软件模拟。

训练任务是 MLPerf 标准设置,在 C4 数据集上预训练 Llama 3.1-8B,收敛目标是验证集困惑度达到 3.3。

前两步只带来了温和的额外 token 开销,但一旦把 Wgrad 也换成 MXFP4,开销直接跳到 26-27%。

Wgrad 是 FP4 训练的瓶颈所在。 前向传播和激活梯度对 FP4 量化有相当的容忍度,但权重梯度一旦被量化到 4 比特,收敛质量就出现了显著退化。

业界此前的主流直觉是:FP4 量化误差本质上是噪声问题,因此可以通过注入随机性来「平滑」误差分布。两种常见策略是:

随机舍入(Stochastic Rounding):在量化时引入随机性,使舍入误差的期望值为零

随机 Hadamard 旋转(Randomized Hadamard):在量化前用带随机符号翻转的 Hadamard 变换打散数据分布

当 Wgrad 被量化后,两种随机性策略不仅没有稳定训练,反而直接导致了不收敛。随机性非但没有帮忙,还在关键的梯度路径上引入了更多有效量化误差。

相比之下,确定性 Hadamard 旋转一把将全流程 token 开销从 26-27% 压回到 8-9%,训练轨迹紧密跟踪 FP8 基线。

这是一个非常有诊断价值的结果。随机和确定性 Hadamard 旋转都是正交变换,都能打散异常值的能量分布,理论上对量化误差的缓解效果应该类似。但它们在 Wgrad 场景下的表现截然相反,这揭示了问题的本质:

FP4 训练的不稳定性,是由 MXFP4 微缩放在敏感梯度路径上产生的结构性误差驱动的。 随机性策略失败是因为它们在每一步引入了不同的误差模式(pattern),而这些变化的误差模式沿梯度路径累积,反而放大了不稳定性。确定性旋转之所以有效,恰恰因为它在每一步施加相同的变换,让误差模式保持一致,避免了误差累积。

端到端效率:训练步吞吐 +20%,综合加速 9-10%

把确定性 Hadamard 旋转加上全流程 MXFP4 之后,效率数据如下:

训练步吞吐提升了 20%,扣掉多出的 8-9% token 开销之后,端到端综合加速仍有 9-10%。

考虑到这是把精度从 8 比特直接砍到 4 比特,这个收敛质量和加速幅度都相当可观。

左图:在 C4 数据集上进行 MLPerf 预训练时,Llama 3.1–8B 的验证困惑度随训练 token 数变化的曲线。结果显示,MXFP4 + 确定性 Hadamard 与 FP8 的表现非常接近,而未进行稳定化处理的全流程 MXFP4 收敛速度更慢,训练稳定性也更差。右图:训练后期的局部放大视图。MLPerf 的目标困惑度为 3.3。与未稳定化的 MXFP4 运行相比,确定性 Hadamard(H16)能够与 FP8 基线保持更紧密的一致性。

值得注意的是,作者在论文中明确强调了一项重要限制:这套 FP4 训练方案(MLPerf C4 数据集 + Llama 3.1-8B)的效果已经得到验证,但不能直接假设它能无缝迁移到所有模型、所有数据集和所有训练方法。FP4 训练的行为可能是高度设置依赖的,具体的稳定策略需要根据场景重新验证。

结语

把这篇论文放到更大的产业脉络里,至少有三层意义。

第一层:它回答了一个根本性的「为什么」。 过去的 FP4 训练工作大多聚焦于「怎么让它不崩」,这篇论文第一次给出了清晰的因果诊断:崩溃源于 Wgrad 路径上的结构性微缩放误差,而非随机性不足。这个诊断本身就具有方法论价值,它告诉后续研究者:在低精度训练中遇到不稳定性时,应该优先排查结构性误差源,而非盲目增加随机性。

第二层:它把 FP4 从「推理专属」推向了「训练可用」。此前行业共识是 FP4 只适合推理量化,训练至少要用 FP8。NVIDIA 在 Blackwell 上主推 FP4 推理而非训练,也反映了这一判断。这篇论文在原生 FP4 硬件上跑通了全流程预训练,意味着 MI355X 和 Blackwell 上那些为推理准备的 FP4 算力,理论上也可以用来训练。如果 FP4 训练在更大模型和更多场景上被验证可行,等于现有硬件的可用训练算力直接翻倍。

第三层:它使用了 OCP 开放标准。 MXFP4 是 OCP Microscaling 格式标准的一部分,背后有 AMD、NVIDIA、Intel、Meta、Microsoft、Arm、Qualcomm 七家公司联合支持。基于开放标准意味着这套方法在不同厂商的硬件上都有可移植性,不会被锁定在单一生态里。

从 FP16 到 FP8,DeepSeek-V3 已经证明精度减半可以大幅降低训练成本。从 FP8 到 FP4,这篇论文迈出了关键的第一步。精度每砍一刀,整个大模型训练的经济性都在发生转变。

本文来自微信公众号 “机器之心”(ID:almosthuman2014),编辑:冷猫

Coinbase Founder Is Now Researching Immortality

Coinbase Founder Invests in the "Fountain of Youth": NewLimit's $435M Funding to Reverse Aging Brian Armstrong, co-founder and CEO of Coinbase, is now targeting a new frontier: human longevity. His biotech startup, NewLimit, has just raised $435 million in a Series C round led by Peter Thiel’s Founders Fund, valuing the company at $3.1 billion. The funding will advance its mission to develop therapies that slow or even reverse cellular aging, with its first drug targeting alcohol-related liver disease set to enter clinical trials next year. Founded in 2021 by Armstrong, former GV partner Blake Byers, and stem cell biologist Jacob Kimmel, NewLimit builds on Nobel Prize-winning research by Shinya Yamanaka, who discovered that adult cells can be reprogrammed to a younger state. NewLimit’s approach focuses on identifying specific gene combinations to reset the "epigenetic age" of cells, aiming first to treat diseases viewed as accelerated aging before broadening its applications. The company joins a growing field of longevity startups backed by tech billionaires, including Sam Altman’s investment in Retro Biosciences and Jeff Bezos’s support for Altos Labs. For these ultra-wealthy backers, conquering aging represents the ultimate investment—extending the one resource even immense wealth cannot buy: time.

Odaily星球日报17m ago

Coinbase Founder Is Now Researching Immortality

Odaily星球日报17m ago

How To Avoid The Major Trap That Bitcoin Is Setting Up For Traders

Bitcoin is testing a critical technical juncture after a recent 5% drop placed it near the lower boundary of an ascending channel on its daily chart. This pattern, formed since February, is starting to appear deceptive. While it offers a potential bullish path back toward $79,000, the structure has already shown weakness with a failed follow-through after its May peak above $82,000. Analysts warn this setup could be a trap for traders anticipating an automatic bounce from channel support. A clean break below this structure could invalidate the pattern of higher lows and trigger a significant decline, with targets between $54,000 and $58,000. The key to avoiding this trap is to not treat any initial rebound as confirmation of recovery, as a deceptive rally toward $75,000 could precede a deeper drop. The market is at a decisive point where support must hold to maintain the bullish structure.

bitcoinist20m ago

How To Avoid The Major Trap That Bitcoin Is Setting Up For Traders

bitcoinist20m ago

Ethereum Repeats A Notable Market Trend As Momentum Wanes – Here’s How Investors Are Positioning

Ethereum's price has dipped below $2,000, but analysts spot a potentially bullish short-term pattern. The cryptocurrency has closed below a key multi-year uptrend line for the second time recently, mirroring a setup from early 2026 that could precede a rebound, though momentum appears weak. Experts suggest this may create entry opportunities. They highlight the upcoming CLARITY Act vote as a pivotal event that could trigger a classic "sell the rumor, buy the news" scenario, advocating for positioning in DeFi and accumulating ETH. On-chain data supports this outlook, showing large investors (whales holding 100,000+ ETH) have been accumulating significantly since mid-April 2026, with their collective holdings reaching a 9-week high, indicating strong long-term conviction amidst current market uncertainty.

bitcoinist21m ago

Ethereum Repeats A Notable Market Trend As Momentum Wanes – Here’s How Investors Are Positioning

bitcoinist21m ago

Software Stocks Scared by AI, How Did They Suddenly Become the Hottest Spot in the U.S. Stock Market?

Software stocks, once feared to be doomed by AI, have staged a dramatic comeback, posting their largest two-day outperformance against the S&P 500 in 25 years. Stocks like Snowflake and Datadog surged over 50% in days, fueled by strong earnings that countered earlier panic. The narrative that AI would render traditional software obsolete has flipped. Reports show AI isn't replacing software but becoming a major user of it, requiring databases, security platforms like Okta, and workflow tools. The low hedge fund positioning in software, at a multi-year low, triggered a massive short squeeze and rally. Analysts now argue the profit wave is shifting from AI hardware to software, as companies integrate AI into complex, governed business processes. The key insight is that enterprises pay for the ability to turn AI "intelligence" into reliable, compliant business outcomes—a gap that established software companies are well-positioned to fill. AI isn't killing software; it's redefining its role and creating new demand.

marsbit45m ago

Software Stocks Scared by AI, How Did They Suddenly Become the Hottest Spot in the U.S. Stock Market?

marsbit45m ago

Why Not Short Even When Bearish? Munger Did the Math on a 'Losing Trade'

Why Not Short Even When Bearish? Charlie Munger's Calculated "Loss-Making Account" Many traders, drawn to speculative tools like futures contracts, often face repeated failures. As the article notes, unless one is a genius, such instruments should be avoided for long-term profit-seeking. Similarly, the practice of short selling is viewed with caution. The author firmly states a policy of not shorting, even when bearish, preferring to simply wait. The core reason? Successful short selling requires exceptionally difficult conditions to profit. Legendary investors Warren Buffett and Charlie Munger have themselves reflected on painful short-selling experiences. Munger highlights two critical flaws in the mathematical logic of shorting: 1. Asymmetrical Risk/Reward: A long position has a maximum loss of 100% but unlimited upside. A short position caps profit at 100% (if a stock falls to zero) but carries theoretically unlimited loss potential. 2. The "Promoter" Problem: Fraudulent or struggling companies can prolong their decline. As Munger said, "You can run out of money before the promoter runs out of ideas," meaning short sellers may be forced to cover positions at a loss before the company's true fate unfolds. The article cites Stanley Druckenmiller, a famed hedge fund manager. He once shorted 12 companies that all eventually went bankrupt. However, intense market rallies forced him to cover his positions within three weeks, resulting in massive losses—$200 million of his capital plus an additional $600 million. He concluded he likely never made money shorting in his career. His experience perfectly illustrates Munger's points: facing unlimited losses and being wiped out before being proven right. The conclusion is clear: for most investors, complex instruments like short selling and derivatives are not viable paths to stable, long-term gains. Self-reflection is advised before repeatedly wasting time and capital on such speculative strategies.

marsbit46m ago

Why Not Short Even When Bearish? Munger Did the Math on a 'Losing Trade'

marsbit46m ago

Trading

Spot

Futures

Hot Articles

Hot Tokens Learning Week 7: Privacy Coins Rally in Rotation, with RIVER Standing Out as 2026’s Surprise Performer

The privacy + payments narrative has been the primary catalyst driving rotation and substantial price gains in privacy coins such as DASH and XMR.

16.5k Total ViewsPublished 2026.01.20Updated 2026.01.20

Hot Tokens Learning Week 7: Privacy Coins Rally in Rotation, with RIVER Standing Out as 2026’s Surprise Performer

Hot Tokens Learning Week 8: ADA's Ouroboros Leios Mainnet Expected to Launch in 2026

ADA's Ouroboros Leios mainnet is expected to launch in 2026, and the hard fork to Protocol Version 11 is planned for Q1 2026.

40.5k Total ViewsPublished 2026.02.10Updated 2026.02.12

Hot Tokens Learning Week 8: ADA's Ouroboros Leios Mainnet Expected to Launch in 2026

Hot Tokens Learning Week 14: Glamsterdam Set to Be Ethereum's Most Closely Watched Upgrade in 2026

Ordinals/Runes continue to drive block fee revenue and developer activity, and are seen as the starting point for Bitcoin's "native asset issuance".

26.6k Total ViewsPublished 2026.04.29Updated 2026.04.29

Hot Tokens Learning Week 14: Glamsterdam Set to Be Ethereum's Most Closely Watched Upgrade in 2026

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of S (S) are presented below.