Turing Award Laureate Sutton's New Work: Using a Formula from 1967 to Solve a Major Flaw in Streaming Reinforcement Learning

marsbit发布于2026-05-10更新于2026-05-10

文章摘要

New research titled "Intentional Updates for Streaming Reinforcement Learning" (arXiv:2604.19033v1), involving Turing Award laureate Richard Sutton, addresses a core challenge in deep reinforcement learning (RL): the "stream barrier." Current deep RL methods typically rely on replay buffers and batch training for stability, failing catastrophically when learning online from single data points (streaming). The authors propose a fundamental shift: instead of prescribing how far to move parameters (a fixed step size), their "Intentional Updates" method specifies the desired change in the function's output (e.g., a 5% reduction in value prediction error). It then calculates the step size needed to achieve that intent. This idea is inspired by the Normalized Least Mean Squares (NLMS) algorithm from 1967. Applied to value and policy learning, this yields algorithms like Intentional TD(λ) and Intentional AC. The method inherently stabilizes learning by adapting the step size based on the local gradient landscape, preventing overshooting/undershooting. In experiments on MuJoCo continuous control and Atari discrete tasks, Intentional AC achieved performance rivaling batch-based algorithms like SAC in a streaming setting (batch size=1, no replay buffer), while being ~140x more computationally efficient per update. The work demonstrates significant robustness, reducing reliance on numerous stabilization tricks. A remaining challenge is bias in policy updates due to action-dependent s...

At the end of 2024, a paper titled "Streaming Deep Reinforcement Learning Finally Works" (arXiv:2410.14606) sparked widespread discussion in the academic community. The authors, from Mahmood's team at the University of Alberta, spent considerable effort describing an embarrassing reality: reinforcement learning, a method that is inherently "learn-as-you-go," has almost become incapable of doing so in the era of deep neural networks. If you simply remove the replay buffer and set the batch size to 1, training collapses. They called this the "stream barrier".

That paper proposed the StreamX series of algorithms, which barely scaled this wall through meticulous tuning of hyperparameters, sparse initialization, and various stabilization techniques.

However, less than a year and a half later, a member of the same research group, along with collaborators from the Openmind Institute, provided a distinctly different answer: the root cause of the stream barrier is not "insufficient data," but "the step size having the wrong unit."

Paper title: Intentional Updates for Streaming Reinforcement Learning

Paper link: https://arxiv.org/pdf/2604.19033v1

Code repository: https://github.com/sharifnassab/Intentional_RL

Stepping on the Gas, How Big a Hole Does It Dig?

Imagine you're learning to parallel park a car. The instructor tells you to "press the gas pedal for 0.1 seconds" each time. The problem is, pressing for the same 0.1 seconds can result in vastly different distances traveled depending on whether you're going uphill, downhill, empty, or fully loaded. Sometimes you're off by a centimeter and park perfectly, other times you're off by 30 centimeters and hit the wall.

Traditional gradient learning step sizes do precisely this: they dictate how much the parameters should move, but exert no control over how much the function's output actually changes. In batch training, the errors of hundreds or thousands of samples are averaged, diluting extreme cases, so the problem isn't obvious. But in a "streaming" environment, where each step involves only one sample, there is no averaging. Once the gradient direction becomes unstable, the magnitude of updates can swing wildly—moving forward 30 cm today, backward 50 cm tomorrow—causing the learning process to collapse amid violent oscillations.

This phenomenon of "overshooting and undershooting" is particularly severe in reinforcement learning because the gradient at each timestep not only varies in magnitude but also changes direction rapidly.

Redefining "How Much a Step Should Do"

In a recent paper, Arsalan Sharifnassab from the Openmind Institute, along with Mohamed Elsayed, A. Rupam Mahmood, and Richard Sutton from the University of Alberta, proposed a solution from a different angle: Instead of specifying how much the parameters should move, directly specify how much the function's output should change.

This idea is not entirely new. In 1967, Japanese scholars Nagumo and Noda, in their paper "A learning method for system identification," proposed the "Normalized Least Mean Squares" (NLMS) algorithm in the field of adaptive filtering; its essence is also using the desired output change to deduce the step size, not the other way around. However, that algorithm was only suitable for simple linear scenarios.

The researchers extended this idea to deep reinforcement learning. They call it "Intentional Updates": before each update, first clarify "what I hope to achieve with this step," then deduce the step size that should be used.

For value learning (i.e., predicting future rewards), their defined intention is: after each update, the prediction error for the current state's value should shrink by a fixed proportion—for example, by 5%, no more, no less. For policy learning (i.e., optimizing decision-making actions), their defined intention is: the probability of selecting the current action is only allowed to change by a "moderate" amount each step.

Using the driving metaphor: this is like the driver deciding before each operation, "I want the car to move forward 20 cm," then automatically calculating how deep to press the gas pedal based on current road conditions (gradient, load), instead of pressing the same depth each time and leaving it to fate.

The Turing Award Laureate and His Puzzle

One of the paper's signatories is Richard S. Sutton—the 2024 Turing Award laureate, widely regarded as the "father of modern reinforcement learning."

Sutton's stature in academia is roughly equivalent to that of Feynman in physics: he not only proposed the Temporal Difference (TD learning) and Policy Gradient frameworks, the foundations of modern reinforcement learning, but also co-authored, with Andrew Barto, the field's most authoritative textbook, "Reinforcement Learning: An Introduction" (now in its second edition, available online for free). He shared the 2024 Turing Award with Barto, with the award citation reading, "for laying the conceptual and algorithmic foundations of reinforcement learning."

After receiving the award, Sutton did not retire but instead invested the prize money into the Openmind Institute he founded, specifically funding young researchers willing to "explore fundamental problems in an environment free from commercial pressure." This new paper emerged from this non-profit institution.

And the paper's first author, Sharifnassab, had recently published the MetaOptimize framework at ICML 2025, researching how to automatically tune learning rates online. The focus of both topics is highly consistent: how to make the step size itself more intelligent.

Algorithm Details: Simpler Than Imagined

The mathematical derivation of "Intentional Updates" is not complex; its core formula can be described in one sentence: the step size equals the "desired output change" divided by the "actual influence of the gradient direction on the output."

In value learning, this "actual influence" is the norm of the gradient vector (essentially measuring how "steep" the current parameter region is): step sizes are smaller in steeper areas and larger in flatter areas, ensuring the impact of each update on the value function remains consistent.

In policy learning, the "desired change" is defined to be proportional to the advantage function: how much better the current action is compared to the average determines how much the policy moves in that direction—normalized in magnitude through a running average, ensuring that over the long term, the magnitude of policy changes remains stable within an interpretable range.

The researchers also combined this core idea with two engineering practices: RMSProp-style diagonal scaling (handling differences in magnitude across parameter dimensions) and eligibility traces (helping reward signals propagate to past timesteps).

This ultimately forms three complete algorithms: Intentional TD (λ) for value prediction, Intentional Q (λ) for discrete action control, and Intentional Policy Gradient for continuous control.

Experimental Results: Matching SAC Even Without GPUs

The paper evaluated this approach on multiple standard benchmarks, with impressive results.

On MuJoCo continuous control tasks (including complex simulated robots like Ant, Humanoid, HalfCheetah), the new method, Intentional AC, in a streaming setup (batch size = 1, no replay buffer), achieved final performance that repeatedly came close to or even matched SAC—an algorithm that uses large-batch replay buffers and is almost the gold standard for current continuous control tasks. In terms of computational cost, each Intentional AC update required only about 1/140th of the floating-point operations of a single SAC update.

On Atari and MinAtar discrete-action games, Intentional Q-learning performed comparably to DQN, which uses a replay buffer, and successfully ran all tasks with the same set of hyperparameters, without requiring per-task tuning.

The researchers also specifically verified whether the "intention" was truly realized: they measured the ratio of actual update magnitude to intended update magnitude. In a simplified setting with eligibility traces disabled, the standard deviation of this ratio was only 0.016 to 0.029, with the 99th percentile all within 1.07; meaning that in the vast majority of cases, the updates indeed achieved "exactly what they were supposed to do."

Furthermore, an ablation study showed that performance declined somewhat but remained competitive after removing RMSProp normalization or the σ term, with this "intentional scaling" itself being the primary contributor, while other components were auxiliary.

Problems Remain

The "Intentional Update" framework also demonstrated significant advantages in robustness. When the researchers removed, one by one, the various stabilizing auxiliary techniques (sparse initialization, reward scaling, input normalization, LayerNorm) that the StreamX method relied on, Intentional AC's performance degradation was significantly less than that of the original StreamAC, indicating that intentional scaling reduces reliance on external "crutches" at the root.

However, the paper also candidly addresses a not-yet-fully-resolved issue: in policy learning, the step size depends on the currently sampled action, which implicitly assigns different "weights" to different actions and may alter the expected direction of the policy gradient. In Humanoid and HumanoidStandup tasks, by measuring the cosine similarity of expected update directions, the researchers found this bias was close to 0.96 (almost negligible) during critical learning phases; but in Ant-v4, the alignment dropped to a median of 0.63, indicating the problem cannot always be ignored.

The authors point out that future research should seek step-size selection strategies independent of the action, keeping the "intention" unbiased in expectation as well. This is a clear assignment left for future researchers in this direction.

Conclusion: Enabling AI to Learn Like Humans, On the Job

The current mainstream paradigm for training large models relies on batch digestion of massive data: feeding in all the text and code from the internet, repeatedly iterating until astonishing capabilities emerge. This path has proven effective, but it is fundamentally "learn first, use later": once training is complete, the model is frozen, unable to continuously update from subsequent real-world interactions.

What streaming reinforcement learning pursues is another, completely different learning mode: not relying on massive replay, not relying on huge GPU clusters, converting every single experience immediately into a parameter update, continuously, cheaply, and adaptively. This is closer to how humans and animals actually learn.

From the initial breakthrough of "finally working" by Elsayed et al. in 2024, to the "Intentional Update" principle proposed in this paper, streaming deep reinforcement learning is maturing at a surprisingly rapid pace. It will not replace batch-trained large models, but for applications requiring long-term online adaptation—like robots, edge devices, and any scenario that cannot afford large replay buffers and GPU clusters—this path is becoming increasingly compelling.

The step size is not just a hyperparameter; it is the AI's commitment to "how much it intends to do" with each step. When this commitment finally becomes controllable, learning itself stabilizes.

This article is from the WeChat public account "Almost Human" (ID: almosthuman2014), author: someone interested in RL.

相关问答

QWhat is the 'stream barrier' problem described in the article?

AThe 'stream barrier' refers to a major difficulty in deep reinforcement learning where the training process collapses when using a streaming setup—meaning no replay buffer and a batch size of one. This prevents the agent from learning effectively from individual, real-time experiences, which is a fundamental characteristic reinforcement learning should possess.

QWhat is the core principle behind the 'Intentional Updates' method proposed in the paper?

AThe core principle of 'Intentional Updates' is to specify how much the function's output (e.g., a value prediction) should change after a parameter update, rather than specifying how much the parameters themselves should move. It inverts the traditional approach by using the desired output change to determine the appropriate step size for the update, leading to more stable learning in a streaming environment.

QHow does the Intentional Updates method relate to historical work from 1967?

AThe idea is conceptually linked to the 1967 Normalized Least Mean Squares (NLMS) algorithm by Nagumo and Noda, which used the expected output change to determine the step size for adaptive filtering. The new paper generalizes this core idea from simple linear settings to the complex, non-linear function approximation context of deep reinforcement learning.

QWhat are some key performance results of the Intentional AC algorithm mentioned in the article?

AIn MuJoCo continuous control tasks with a strict streaming setup (batch size=1, no replay buffer), the Intentional AC algorithm achieved final performance close to or on par with SAC, a state-of-the-art method that uses large batch replay buffers. Furthermore, each Intentional AC update required about 1/140th the floating-point operations (FLOPS) of a single SAC update.

QWhat is a limitation or open problem acknowledged for the Intentional Updates method, particularly in policy learning?

AIn policy learning, the step size depends on the currently sampled action. This can implicitly assign different weights to different actions, potentially biasing the expected direction of the policy gradient. The paper notes that while this bias is negligible in some tasks, it can be more significant in others (e.g., Ant-v4), indicating a need for future research into action-independent step size selection strategies.

你可能也喜欢

全球发卡进入合规驱动时代:WasabiCard 正在构建下一代支付基础设施

近日,WasabiCard发布行业观点指出,随着稳定币在跨境支付、企业结算等场景中的应用深化,全球发卡行业正从“增长驱动”转向“合规驱动”阶段。未来竞争重点将从“是否可用”转向“能否在全球合规框架下长期稳定运行”。 WasabiCard联合创始人Ray表示,稳定币正成为全球商业的重要支付媒介,下一阶段的竞争核心在于企业是否具备长期合规运营与全球化基础设施能力。通过Visa、Mastercard等支付网络,稳定币得以进入日常消费、企业支付与跨境资金流转场景。 行业快速扩张也暴露出跨区域发卡、BIN资源管理、匿名发卡及反洗钱与风控能力不足等问题。因此,行业正从依赖“灰色效率”的增长模式,转向更重视合规、风控与长期运营能力的发展路径。 为此,WasabiCard披露了其基础设施策略,包括:与持牌主会员及本地合作伙伴协作建立本地化运营体系、构建完善的KYC与AML风控体系、严格区分商业与消费BIN使用场景,并持续完善全球发卡、支付与跨境资金流转基础设施。其目标是在全球监管演进背景下,构建具备长期稳定性与可扩展性的支付基础设施。 未来,全球发卡行业的竞争逻辑将从“规模竞争”转向“基础设施能力竞争”。WasabiCard将继续围绕全球发卡、稳定币支付、跨境资金流转及API驱动金融工作流等方向,完善基础设施能力,推动稳定币在真实商业世界中的规模化应用。

marsbit34分钟前

全球发卡进入合规驱动时代:WasabiCard 正在构建下一代支付基础设施

marsbit34分钟前

周航:SpaceX 究竟值多少钱?

SpaceX被广泛认为是过去50年最伟大的工业公司之一,其颠覆了国家垄断的航天产业,将发射成本大幅降低并实现了火箭可重复使用,主导了全球商业发射市场并建立了Starlink卫星互联网。然而,作者周航指出,其当前约1.75万亿美元的IPO估值可能被高估了约1.25万亿美元。 通过分析SpaceX各项业务到2030年的乐观收入预期(约500-800亿美元)及盈利,并采用科技公司较高的估值倍数,其“合理”估值区间应在5000亿至1.2万亿美元。当前1.75万亿美元的溢价主要源于三方面:对未来愿景(如Starship大幅降低成本、开启太空新业务)的期望、其作为美国战略资产的主权地位溢价,以及散户对埃隆·马斯克个人及其宏大叙事的追捧。 文章预测IPO后可能出现的三种情景:愿景快速兑现推高估值(概率25%)、进展不及预期导致估值高位震荡(概率50%)、或风险事件引发情绪溢价收缩、估值向合理区间回落(概率25%)。概率加权后的期望估值中枢约为1.3-1.5万亿美元,低于IPO申报价。因此,作者认为在IPO当天以1.75万亿美元的价格买入,从未来3-5年的预期收益看并不有利。 核心结论是:SpaceX是一家伟大的公司,但伟大的公司不等于在任何价格都值得投资的股票。投资者需要分清自己购买的是公司的长期价值,还是为“参与历史”的叙事支付溢价。建议理性投资者避免在IPO首日追高,可等待关键里程碑实现或价格回调后再考虑。

链捕手38分钟前

周航:SpaceX 究竟值多少钱?

链捕手38分钟前

交易

现货
合约
活动图片