Turing Award Laureate Sutton's New Work: Using a Formula from 1967 to Solve a Major Flaw in Streaming Reinforcement Learning

marsbit发布于2026-05-10更新于2026-05-10

文章摘要

New research titled "Intentional Updates for Streaming Reinforcement Learning" (arXiv:2604.19033v1), involving Turing Award laureate Richard Sutton, addresses a core challenge in deep reinforcement learning (RL): the "stream barrier." Current deep RL methods typically rely on replay buffers and batch training for stability, failing catastrophically when learning online from single data points (streaming). The authors propose a fundamental shift: instead of prescribing how far to move parameters (a fixed step size), their "Intentional Updates" method specifies the desired change in the function's output (e.g., a 5% reduction in value prediction error). It then calculates the step size needed to achieve that intent. This idea is inspired by the Normalized Least Mean Squares (NLMS) algorithm from 1967. Applied to value and policy learning, this yields algorithms like Intentional TD(λ) and Intentional AC. The method inherently stabilizes learning by adapting the step size based on the local gradient landscape, preventing overshooting/undershooting. In experiments on MuJoCo continuous control and Atari discrete tasks, Intentional AC achieved performance rivaling batch-based algorithms like SAC in a streaming setting (batch size=1, no replay buffer), while being ~140x more computationally efficient per update. The work demonstrates significant robustness, reducing reliance on numerous stabilization tricks. A remaining challenge is bias in policy updates due to action-dependent s...

At the end of 2024, a paper titled "Streaming Deep Reinforcement Learning Finally Works" (arXiv:2410.14606) sparked widespread discussion in the academic community. The authors, from Mahmood's team at the University of Alberta, spent considerable effort describing an embarrassing reality: reinforcement learning, a method that is inherently "learn-as-you-go," has almost become incapable of doing so in the era of deep neural networks. If you simply remove the replay buffer and set the batch size to 1, training collapses. They called this the "stream barrier".

That paper proposed the StreamX series of algorithms, which barely scaled this wall through meticulous tuning of hyperparameters, sparse initialization, and various stabilization techniques.

However, less than a year and a half later, a member of the same research group, along with collaborators from the Openmind Institute, provided a distinctly different answer: the root cause of the stream barrier is not "insufficient data," but "the step size having the wrong unit."

Paper title: Intentional Updates for Streaming Reinforcement Learning

Paper link: https://arxiv.org/pdf/2604.19033v1

Code repository: https://github.com/sharifnassab/Intentional_RL

Stepping on the Gas, How Big a Hole Does It Dig?

Imagine you're learning to parallel park a car. The instructor tells you to "press the gas pedal for 0.1 seconds" each time. The problem is, pressing for the same 0.1 seconds can result in vastly different distances traveled depending on whether you're going uphill, downhill, empty, or fully loaded. Sometimes you're off by a centimeter and park perfectly, other times you're off by 30 centimeters and hit the wall.

Traditional gradient learning step sizes do precisely this: they dictate how much the parameters should move, but exert no control over how much the function's output actually changes. In batch training, the errors of hundreds or thousands of samples are averaged, diluting extreme cases, so the problem isn't obvious. But in a "streaming" environment, where each step involves only one sample, there is no averaging. Once the gradient direction becomes unstable, the magnitude of updates can swing wildly—moving forward 30 cm today, backward 50 cm tomorrow—causing the learning process to collapse amid violent oscillations.

This phenomenon of "overshooting and undershooting" is particularly severe in reinforcement learning because the gradient at each timestep not only varies in magnitude but also changes direction rapidly.

Redefining "How Much a Step Should Do"

In a recent paper, Arsalan Sharifnassab from the Openmind Institute, along with Mohamed Elsayed, A. Rupam Mahmood, and Richard Sutton from the University of Alberta, proposed a solution from a different angle: Instead of specifying how much the parameters should move, directly specify how much the function's output should change.

This idea is not entirely new. In 1967, Japanese scholars Nagumo and Noda, in their paper "A learning method for system identification," proposed the "Normalized Least Mean Squares" (NLMS) algorithm in the field of adaptive filtering; its essence is also using the desired output change to deduce the step size, not the other way around. However, that algorithm was only suitable for simple linear scenarios.

The researchers extended this idea to deep reinforcement learning. They call it "Intentional Updates": before each update, first clarify "what I hope to achieve with this step," then deduce the step size that should be used.

For value learning (i.e., predicting future rewards), their defined intention is: after each update, the prediction error for the current state's value should shrink by a fixed proportion—for example, by 5%, no more, no less. For policy learning (i.e., optimizing decision-making actions), their defined intention is: the probability of selecting the current action is only allowed to change by a "moderate" amount each step.

Using the driving metaphor: this is like the driver deciding before each operation, "I want the car to move forward 20 cm," then automatically calculating how deep to press the gas pedal based on current road conditions (gradient, load), instead of pressing the same depth each time and leaving it to fate.

The Turing Award Laureate and His Puzzle

One of the paper's signatories is Richard S. Sutton—the 2024 Turing Award laureate, widely regarded as the "father of modern reinforcement learning."

Sutton's stature in academia is roughly equivalent to that of Feynman in physics: he not only proposed the Temporal Difference (TD learning) and Policy Gradient frameworks, the foundations of modern reinforcement learning, but also co-authored, with Andrew Barto, the field's most authoritative textbook, "Reinforcement Learning: An Introduction" (now in its second edition, available online for free). He shared the 2024 Turing Award with Barto, with the award citation reading, "for laying the conceptual and algorithmic foundations of reinforcement learning."

After receiving the award, Sutton did not retire but instead invested the prize money into the Openmind Institute he founded, specifically funding young researchers willing to "explore fundamental problems in an environment free from commercial pressure." This new paper emerged from this non-profit institution.

And the paper's first author, Sharifnassab, had recently published the MetaOptimize framework at ICML 2025, researching how to automatically tune learning rates online. The focus of both topics is highly consistent: how to make the step size itself more intelligent.

Algorithm Details: Simpler Than Imagined

The mathematical derivation of "Intentional Updates" is not complex; its core formula can be described in one sentence: the step size equals the "desired output change" divided by the "actual influence of the gradient direction on the output."

In value learning, this "actual influence" is the norm of the gradient vector (essentially measuring how "steep" the current parameter region is): step sizes are smaller in steeper areas and larger in flatter areas, ensuring the impact of each update on the value function remains consistent.

In policy learning, the "desired change" is defined to be proportional to the advantage function: how much better the current action is compared to the average determines how much the policy moves in that direction—normalized in magnitude through a running average, ensuring that over the long term, the magnitude of policy changes remains stable within an interpretable range.

The researchers also combined this core idea with two engineering practices: RMSProp-style diagonal scaling (handling differences in magnitude across parameter dimensions) and eligibility traces (helping reward signals propagate to past timesteps).

This ultimately forms three complete algorithms: Intentional TD (λ) for value prediction, Intentional Q (λ) for discrete action control, and Intentional Policy Gradient for continuous control.

Experimental Results: Matching SAC Even Without GPUs

The paper evaluated this approach on multiple standard benchmarks, with impressive results.

On MuJoCo continuous control tasks (including complex simulated robots like Ant, Humanoid, HalfCheetah), the new method, Intentional AC, in a streaming setup (batch size = 1, no replay buffer), achieved final performance that repeatedly came close to or even matched SAC—an algorithm that uses large-batch replay buffers and is almost the gold standard for current continuous control tasks. In terms of computational cost, each Intentional AC update required only about 1/140th of the floating-point operations of a single SAC update.

On Atari and MinAtar discrete-action games, Intentional Q-learning performed comparably to DQN, which uses a replay buffer, and successfully ran all tasks with the same set of hyperparameters, without requiring per-task tuning.

The researchers also specifically verified whether the "intention" was truly realized: they measured the ratio of actual update magnitude to intended update magnitude. In a simplified setting with eligibility traces disabled, the standard deviation of this ratio was only 0.016 to 0.029, with the 99th percentile all within 1.07; meaning that in the vast majority of cases, the updates indeed achieved "exactly what they were supposed to do."

Furthermore, an ablation study showed that performance declined somewhat but remained competitive after removing RMSProp normalization or the σ term, with this "intentional scaling" itself being the primary contributor, while other components were auxiliary.

Problems Remain

The "Intentional Update" framework also demonstrated significant advantages in robustness. When the researchers removed, one by one, the various stabilizing auxiliary techniques (sparse initialization, reward scaling, input normalization, LayerNorm) that the StreamX method relied on, Intentional AC's performance degradation was significantly less than that of the original StreamAC, indicating that intentional scaling reduces reliance on external "crutches" at the root.

However, the paper also candidly addresses a not-yet-fully-resolved issue: in policy learning, the step size depends on the currently sampled action, which implicitly assigns different "weights" to different actions and may alter the expected direction of the policy gradient. In Humanoid and HumanoidStandup tasks, by measuring the cosine similarity of expected update directions, the researchers found this bias was close to 0.96 (almost negligible) during critical learning phases; but in Ant-v4, the alignment dropped to a median of 0.63, indicating the problem cannot always be ignored.

The authors point out that future research should seek step-size selection strategies independent of the action, keeping the "intention" unbiased in expectation as well. This is a clear assignment left for future researchers in this direction.

Conclusion: Enabling AI to Learn Like Humans, On the Job

The current mainstream paradigm for training large models relies on batch digestion of massive data: feeding in all the text and code from the internet, repeatedly iterating until astonishing capabilities emerge. This path has proven effective, but it is fundamentally "learn first, use later": once training is complete, the model is frozen, unable to continuously update from subsequent real-world interactions.

What streaming reinforcement learning pursues is another, completely different learning mode: not relying on massive replay, not relying on huge GPU clusters, converting every single experience immediately into a parameter update, continuously, cheaply, and adaptively. This is closer to how humans and animals actually learn.

From the initial breakthrough of "finally working" by Elsayed et al. in 2024, to the "Intentional Update" principle proposed in this paper, streaming deep reinforcement learning is maturing at a surprisingly rapid pace. It will not replace batch-trained large models, but for applications requiring long-term online adaptation—like robots, edge devices, and any scenario that cannot afford large replay buffers and GPU clusters—this path is becoming increasingly compelling.

The step size is not just a hyperparameter; it is the AI's commitment to "how much it intends to do" with each step. When this commitment finally becomes controllable, learning itself stabilizes.

This article is from the WeChat public account "Almost Human" (ID: almosthuman2014), author: someone interested in RL.

你可能也喜欢

Base 的压力时刻

北京时间7月21日，Base联创Jesse Pollak承认，在代币化股票领域进展落后于新推出的Robinhood Chain，后者采用了衍生品模式，而Base正与Coinbase合作开发由股票1:1支持的代币化股票。这是Jesse近期第二次公开反思，此前他已承认押注社交和创作者代币是战略失误，相关尝试未带来可持续采用。 Base凭借Coinbase的支持，在L2竞争中表现强劲，尤其在meme币领域优势明显，但其中心化问题一直备受诟病。近期两次出块中断事件凸显了单一排序器风险，L2BEAT甚至考虑将其去中心化评级从Stage 1降回Stage 0。虽然该评级不完全代表安全性，但在Robinhood Chain快速崛起、其DEX交易量迅速冲进前五的对比下，Base在去中心化上的滞后显得尤为突出。此外，Coinbase创始人Brian Armstrong因更换头像引发相关meme币BRAIN剧烈波动后，社区反应负面，中文社区账号改名以示嘲讽，反映了用户信任度的下滑。尽管Base仍拥有约120亿美元的TVL，并在机器支付领域掌握标准制定权，其长期目标是成为金融基础设施。但面对Robinhood Chain等新晋竞争者的压力，Base迫切需要解决长期存在的技术中心化与社区信任问题，以巩固其在代币化等关键领域的地位。

Foresight News3分钟前

Foresight News3分钟前

白宫让步扫清伦理障碍，Clarity Act赶上休会前最后的时间窗口？

北京时间7月21日，据多位消息人士透露，特朗普政府已在《数字资产市场结构法案》（Clarity Act）中同意加入伦理条款，并已将文本提交给部分参议院共和党议员。此举被视为扫清了该法案推进的最后主要障碍之一。同时，白宫数字资产顾问委员会执行主任Patrick Witt确认留任，将继续推动法案完成最后冲刺。 Clarity Act旨在为美国数字资产市场建立统一的联邦监管框架，核心目标是明确数字资产的法律属性，并划分美国证券交易委员会（SEC）与商品期货交易委员会（CFTC）的监管职责。法案将数字资产分类监管，以期结束SEC与CFTC长期的监管权争夺，为行业提供明确的合规路径。过去一年，法案的谈判分歧主要集中在稳定币收益规则、DeFi监管边界以及政府官员与加密行业的利益冲突（即伦理问题）三方面。目前前两大分歧已基本解决，伦理条款成为最后的关键争议点。白宫的让步为法案在参议院获得两党支持并最终表决创造了可能。然而，法案面临紧迫的时间窗口。美国国会预计在8月中旬进入夏季休会期，留给参议院审议的时间仅剩十几个工作日。行业游说组织美国区块链协会CEO表示，未来几周是关键时刻，若伦理争议得以解决，法案有望在休会前取得突破；否则可能需等待新的政治时机。如果Clarity Act能成功通过，它将成为美国乃至全球加密货币监管的一个重要转折点，为数字资产市场提供更清晰、稳定的制度框架，降低监管不确定性，并为传统资本进入该领域奠定基础。

Odaily星球日报9分钟前

Odaily星球日报9分钟前

Midnight遭5.15亿NIGHT黑客攻击导致代币暴跌32%——0.015美元能守住吗？

2026年加密货币市场黑客事件频发，7月单月损失超5900万美元，年内累计已达10亿美元。Midnight网络近日成为跨链桥攻击的最新受害者，其Wanchain Cardano至BNB跨链桥上一个存有5.15亿枚NIGHT代币的旧合约遭黑客攻击，资金被转移并在Cardano去中心化交易所抛售。事件导致NIGHT代币价格暴跌32%，创下0.015美元的历史新低，随后小幅反弹至0.019美元。其市值缩水27%至3.24亿美元，而交易量激增829%，显示抛压巨大。现货与期货市场均出现大量卖盘，期货资金净流出达510万美元。技术指标显示，NIGHT的相对强弱指数（RSI）已跌至17的超卖区间，市场情绪极度悲观。若看跌情绪持续，代币价格可能继续承压，在0.02美元以下波动，并将0.015美元视为关键支撑位。Midnight基金会强调，此次事件仅限于跨链桥操作，其主网本身仍保持安全。

ambcrypto14分钟前

Midnight遭5.15亿NIGHT黑客攻击导致代币暴跌32%——0.015美元能守住吗？

ambcrypto14分钟前

AI时代、产业革命与未来文明访谈——张丁文：未来不属于追赶者

本文是对90后科技企业家张丁文的专访。他强调，决定企业命运的并非短期风口，而是时代演进的方向。真正的创业者应关注人与数字世界连接方式的根本变化，而非追逐热点。回顾创业历程，张丁文认为早期摄影社区项目的经历教会他重要一课：用户价值不等同于商业价值，可持续的商业模式至关重要。他将创业视为不断刷新自我认知的过程，企业的边界取决于创始人的认知格局。面对未来，张丁文将重心从单一产品转向寻找下一代生态“入口”。他深度布局智能穿戴领域，认为智能硬件如手表的价值远超越硬件本身，在于其作为连接健康、支付、社交等服务的平台潜力，是建立长期用户关系、承载复合属性的生态容器。企业的竞争最终是成为用户不可或缺的信任入口。张丁文进一步将思考提升至产业与文明层面。他认为，企业的发展会经历产品竞争、平台竞争，最终是“文明竞争”，即定义未来运行规则的能力。真正伟大的企业致力于解决时代问题，推动社会效率与公平，其最深的护城河是价值观与长期积累的信任。他表示，财富只是价值的计量单位，而非目标。下一代企业家不仅需要经营能力，更需要广阔的世界观，在复杂变化中坚持长期主义，保持学习和进化。企业的终极意义在于创造持久的社会价值，成为时代进步的推动者。

marsbit20分钟前