Turing Award Laureate Sutton's New Work: Using a Formula from 1967 to Solve a Major Flaw in Streaming Reinforcement Learning

marsbit发布于2026-05-10更新于2026-05-10

文章摘要

New research titled "Intentional Updates for Streaming Reinforcement Learning" (arXiv:2604.19033v1), involving Turing Award laureate Richard Sutton, addresses a core challenge in deep reinforcement learning (RL): the "stream barrier." Current deep RL methods typically rely on replay buffers and batch training for stability, failing catastrophically when learning online from single data points (streaming). The authors propose a fundamental shift: instead of prescribing how far to move parameters (a fixed step size), their "Intentional Updates" method specifies the desired change in the function's output (e.g., a 5% reduction in value prediction error). It then calculates the step size needed to achieve that intent. This idea is inspired by the Normalized Least Mean Squares (NLMS) algorithm from 1967. Applied to value and policy learning, this yields algorithms like Intentional TD(λ) and Intentional AC. The method inherently stabilizes learning by adapting the step size based on the local gradient landscape, preventing overshooting/undershooting. In experiments on MuJoCo continuous control and Atari discrete tasks, Intentional AC achieved performance rivaling batch-based algorithms like SAC in a streaming setting (batch size=1, no replay buffer), while being ~140x more computationally efficient per update. The work demonstrates significant robustness, reducing reliance on numerous stabilization tricks. A remaining challenge is bias in policy updates due to action-dependent s...

At the end of 2024, a paper titled "Streaming Deep Reinforcement Learning Finally Works" (arXiv:2410.14606) sparked widespread discussion in the academic community. The authors, from Mahmood's team at the University of Alberta, spent considerable effort describing an embarrassing reality: reinforcement learning, a method that is inherently "learn-as-you-go," has almost become incapable of doing so in the era of deep neural networks. If you simply remove the replay buffer and set the batch size to 1, training collapses. They called this the "stream barrier".

That paper proposed the StreamX series of algorithms, which barely scaled this wall through meticulous tuning of hyperparameters, sparse initialization, and various stabilization techniques.

However, less than a year and a half later, a member of the same research group, along with collaborators from the Openmind Institute, provided a distinctly different answer: the root cause of the stream barrier is not "insufficient data," but "the step size having the wrong unit."

Paper title: Intentional Updates for Streaming Reinforcement Learning

Paper link: https://arxiv.org/pdf/2604.19033v1

Code repository: https://github.com/sharifnassab/Intentional_RL

Stepping on the Gas, How Big a Hole Does It Dig?

Imagine you're learning to parallel park a car. The instructor tells you to "press the gas pedal for 0.1 seconds" each time. The problem is, pressing for the same 0.1 seconds can result in vastly different distances traveled depending on whether you're going uphill, downhill, empty, or fully loaded. Sometimes you're off by a centimeter and park perfectly, other times you're off by 30 centimeters and hit the wall.

Traditional gradient learning step sizes do precisely this: they dictate how much the parameters should move, but exert no control over how much the function's output actually changes. In batch training, the errors of hundreds or thousands of samples are averaged, diluting extreme cases, so the problem isn't obvious. But in a "streaming" environment, where each step involves only one sample, there is no averaging. Once the gradient direction becomes unstable, the magnitude of updates can swing wildly—moving forward 30 cm today, backward 50 cm tomorrow—causing the learning process to collapse amid violent oscillations.

This phenomenon of "overshooting and undershooting" is particularly severe in reinforcement learning because the gradient at each timestep not only varies in magnitude but also changes direction rapidly.

Redefining "How Much a Step Should Do"

In a recent paper, Arsalan Sharifnassab from the Openmind Institute, along with Mohamed Elsayed, A. Rupam Mahmood, and Richard Sutton from the University of Alberta, proposed a solution from a different angle: Instead of specifying how much the parameters should move, directly specify how much the function's output should change.

This idea is not entirely new. In 1967, Japanese scholars Nagumo and Noda, in their paper "A learning method for system identification," proposed the "Normalized Least Mean Squares" (NLMS) algorithm in the field of adaptive filtering; its essence is also using the desired output change to deduce the step size, not the other way around. However, that algorithm was only suitable for simple linear scenarios.

The researchers extended this idea to deep reinforcement learning. They call it "Intentional Updates": before each update, first clarify "what I hope to achieve with this step," then deduce the step size that should be used.

For value learning (i.e., predicting future rewards), their defined intention is: after each update, the prediction error for the current state's value should shrink by a fixed proportion—for example, by 5%, no more, no less. For policy learning (i.e., optimizing decision-making actions), their defined intention is: the probability of selecting the current action is only allowed to change by a "moderate" amount each step.

Using the driving metaphor: this is like the driver deciding before each operation, "I want the car to move forward 20 cm," then automatically calculating how deep to press the gas pedal based on current road conditions (gradient, load), instead of pressing the same depth each time and leaving it to fate.

The Turing Award Laureate and His Puzzle

One of the paper's signatories is Richard S. Sutton—the 2024 Turing Award laureate, widely regarded as the "father of modern reinforcement learning."

Sutton's stature in academia is roughly equivalent to that of Feynman in physics: he not only proposed the Temporal Difference (TD learning) and Policy Gradient frameworks, the foundations of modern reinforcement learning, but also co-authored, with Andrew Barto, the field's most authoritative textbook, "Reinforcement Learning: An Introduction" (now in its second edition, available online for free). He shared the 2024 Turing Award with Barto, with the award citation reading, "for laying the conceptual and algorithmic foundations of reinforcement learning."

After receiving the award, Sutton did not retire but instead invested the prize money into the Openmind Institute he founded, specifically funding young researchers willing to "explore fundamental problems in an environment free from commercial pressure." This new paper emerged from this non-profit institution.

And the paper's first author, Sharifnassab, had recently published the MetaOptimize framework at ICML 2025, researching how to automatically tune learning rates online. The focus of both topics is highly consistent: how to make the step size itself more intelligent.

Algorithm Details: Simpler Than Imagined

The mathematical derivation of "Intentional Updates" is not complex; its core formula can be described in one sentence: the step size equals the "desired output change" divided by the "actual influence of the gradient direction on the output."

In value learning, this "actual influence" is the norm of the gradient vector (essentially measuring how "steep" the current parameter region is): step sizes are smaller in steeper areas and larger in flatter areas, ensuring the impact of each update on the value function remains consistent.

In policy learning, the "desired change" is defined to be proportional to the advantage function: how much better the current action is compared to the average determines how much the policy moves in that direction—normalized in magnitude through a running average, ensuring that over the long term, the magnitude of policy changes remains stable within an interpretable range.

The researchers also combined this core idea with two engineering practices: RMSProp-style diagonal scaling (handling differences in magnitude across parameter dimensions) and eligibility traces (helping reward signals propagate to past timesteps).

This ultimately forms three complete algorithms: Intentional TD (λ) for value prediction, Intentional Q (λ) for discrete action control, and Intentional Policy Gradient for continuous control.

Experimental Results: Matching SAC Even Without GPUs

The paper evaluated this approach on multiple standard benchmarks, with impressive results.

On MuJoCo continuous control tasks (including complex simulated robots like Ant, Humanoid, HalfCheetah), the new method, Intentional AC, in a streaming setup (batch size = 1, no replay buffer), achieved final performance that repeatedly came close to or even matched SAC—an algorithm that uses large-batch replay buffers and is almost the gold standard for current continuous control tasks. In terms of computational cost, each Intentional AC update required only about 1/140th of the floating-point operations of a single SAC update.

On Atari and MinAtar discrete-action games, Intentional Q-learning performed comparably to DQN, which uses a replay buffer, and successfully ran all tasks with the same set of hyperparameters, without requiring per-task tuning.

The researchers also specifically verified whether the "intention" was truly realized: they measured the ratio of actual update magnitude to intended update magnitude. In a simplified setting with eligibility traces disabled, the standard deviation of this ratio was only 0.016 to 0.029, with the 99th percentile all within 1.07; meaning that in the vast majority of cases, the updates indeed achieved "exactly what they were supposed to do."

Furthermore, an ablation study showed that performance declined somewhat but remained competitive after removing RMSProp normalization or the σ term, with this "intentional scaling" itself being the primary contributor, while other components were auxiliary.

Problems Remain

The "Intentional Update" framework also demonstrated significant advantages in robustness. When the researchers removed, one by one, the various stabilizing auxiliary techniques (sparse initialization, reward scaling, input normalization, LayerNorm) that the StreamX method relied on, Intentional AC's performance degradation was significantly less than that of the original StreamAC, indicating that intentional scaling reduces reliance on external "crutches" at the root.

However, the paper also candidly addresses a not-yet-fully-resolved issue: in policy learning, the step size depends on the currently sampled action, which implicitly assigns different "weights" to different actions and may alter the expected direction of the policy gradient. In Humanoid and HumanoidStandup tasks, by measuring the cosine similarity of expected update directions, the researchers found this bias was close to 0.96 (almost negligible) during critical learning phases; but in Ant-v4, the alignment dropped to a median of 0.63, indicating the problem cannot always be ignored.

The authors point out that future research should seek step-size selection strategies independent of the action, keeping the "intention" unbiased in expectation as well. This is a clear assignment left for future researchers in this direction.

Conclusion: Enabling AI to Learn Like Humans, On the Job

The current mainstream paradigm for training large models relies on batch digestion of massive data: feeding in all the text and code from the internet, repeatedly iterating until astonishing capabilities emerge. This path has proven effective, but it is fundamentally "learn first, use later": once training is complete, the model is frozen, unable to continuously update from subsequent real-world interactions.

What streaming reinforcement learning pursues is another, completely different learning mode: not relying on massive replay, not relying on huge GPU clusters, converting every single experience immediately into a parameter update, continuously, cheaply, and adaptively. This is closer to how humans and animals actually learn.

From the initial breakthrough of "finally working" by Elsayed et al. in 2024, to the "Intentional Update" principle proposed in this paper, streaming deep reinforcement learning is maturing at a surprisingly rapid pace. It will not replace batch-trained large models, but for applications requiring long-term online adaptation—like robots, edge devices, and any scenario that cannot afford large replay buffers and GPU clusters—this path is becoming increasingly compelling.

The step size is not just a hyperparameter; it is the AI's commitment to "how much it intends to do" with each step. When this commitment finally becomes controllable, learning itself stabilizes.

This article is from the WeChat public account "Almost Human" (ID: almosthuman2014), author: someone interested in RL.

相关问答

QWhat is the 'stream barrier' problem described in the article?

AThe 'stream barrier' refers to a major difficulty in deep reinforcement learning where the training process collapses when using a streaming setup—meaning no replay buffer and a batch size of one. This prevents the agent from learning effectively from individual, real-time experiences, which is a fundamental characteristic reinforcement learning should possess.

QWhat is the core principle behind the 'Intentional Updates' method proposed in the paper?

AThe core principle of 'Intentional Updates' is to specify how much the function's output (e.g., a value prediction) should change after a parameter update, rather than specifying how much the parameters themselves should move. It inverts the traditional approach by using the desired output change to determine the appropriate step size for the update, leading to more stable learning in a streaming environment.

QHow does the Intentional Updates method relate to historical work from 1967?

AThe idea is conceptually linked to the 1967 Normalized Least Mean Squares (NLMS) algorithm by Nagumo and Noda, which used the expected output change to determine the step size for adaptive filtering. The new paper generalizes this core idea from simple linear settings to the complex, non-linear function approximation context of deep reinforcement learning.

QWhat are some key performance results of the Intentional AC algorithm mentioned in the article?

AIn MuJoCo continuous control tasks with a strict streaming setup (batch size=1, no replay buffer), the Intentional AC algorithm achieved final performance close to or on par with SAC, a state-of-the-art method that uses large batch replay buffers. Furthermore, each Intentional AC update required about 1/140th the floating-point operations (FLOPS) of a single SAC update.

QWhat is a limitation or open problem acknowledged for the Intentional Updates method, particularly in policy learning?

AIn policy learning, the step size depends on the currently sampled action. This can implicitly assign different weights to different actions, potentially biasing the expected direction of the policy gradient. The paper notes that while this bias is negligible in some tasks, it can be more significant in others (e.g., Ant-v4), indicating a need for future research into action-independent step size selection strategies.

你可能也喜欢

AI PC大战:不要押阵营,要押收费站

英伟达与联发科切入AI PC,标志着Windows端侧AI生态进入多玩家竞争阶段。作者认为,不应简单将其视为“x86对Arm”的阵营之争,而应关注谁能持续获取利润与产业链定价权。 AI PC的投资机会可分为三层:一是先进制程“收费站”,无论哪方胜出,台积电(TSMC)都将受益;二是算力与平台外溢,以AMD(x86进攻)和英伟达(GPU软件栈延伸)为代表;三是架构扩散和困境反转,Arm和英特尔(INTC)具备弹性但需谨慎。 行业已从概念进入出货验证期。尽管短期出货预测有所下调,但AI PC长期标配化趋势不变。投资难点在于用户换机意愿,若企业端广泛部署隐私计算等应用,将推动市场从消费电子转向企业IT更新。 竞争格局上,各芯片厂商优势各异,但高端芯片均依赖先进制程。台积电在晶圆代工市场占据超70%份额,成为AI硬件时代的确定性受益者。 投资策略上,作者建议分层配置:将台积电视为底仓(确定性现金流),AMD作为进攻性选择,Arm和英特尔则用于捕捉弹性机会。核心逻辑是投资“收费站”和平台,而非押注单一架构。 风险包括:AI PC应用不及预期、Windows on Arm兼容性改善缓慢、关税与宏观因素影响需求、先进制程供需错配,以及整体AI估值偏高可能引发的回调。因此,应将AI PC视为长期产业趋势,在情绪退潮后布局生态与现金流稳定的公司。

marsbit10分钟前

AI PC大战:不要押阵营,要押收费站

marsbit10分钟前

万字解析:从10美元到290美元,MRVL靠「不做GPU」赢了整个AI时代

Marvell Technology(MRVL)股价从2016年不到10美元涨至2026年的290美元,涨幅达30倍,核心在于其独特定位:不做GPU,而是专注于AI时代的“连接”基础设施。 公司业务分为三块:一是光互连(光DSP),在400G以上数据中心光模块市场占约70%份额,技术护城河深;二是定制AI芯片,为Amazon等云巨头设计XPU,拥有18个项目、750亿美元潜在收入;三是以太网交换芯片与企业存储,提供稳定现金流。 CEO Matt Murphy上任后大幅改革,砍掉非核心业务,收购Inphi(光DSP)、Cavium、Celestial AI(光子织网)等公司,聚焦数据中心,并绑定大客户获得长期订单。 英伟达投资20亿美元战略入股,认可Marvell在AI互连生态的价值。市场常将Marvell视为“小Broadcom”,但两者本质不同:Marvell在光DSP是领导者,而定制芯片业务虽毛利率较低,但随规模扩大有望改善。 主要风险包括:丢失Amazon Trainium3订单、客户集中度高、毛利率天花板、英伟达既是伙伴也是潜在竞争者、内部人士减持及供应链产能压力。但公司光互连技术优势显著,结合PEG约0.6的估值,仍有增长空间。 本质上,Marvell抓住了AI基础设施从“堆算力”转向“建系统”的趋势。在AI集群规模不断扩大、数据流动需求激增的背景下,“连接”的价值日益凸显,而Marvell正处在这一核心位置。

marsbit40分钟前

万字解析:从10美元到290美元,MRVL靠「不做GPU」赢了整个AI时代

marsbit40分钟前

AI中转站引发知乎热议:便宜Token背后,用户真正担心什么?

知乎上关于“AI中转站与便宜Token”的讨论引发广泛关注,焦点从单纯的工具选择转向了深层的成本与信任问题。 用户首要担忧的是模型真实性。AI中转站被类比为“AI版黄牛”,技术门槛不高,但上游来源常不透明,存在“模型掉包”风险。由于大模型输出具有随机性,普通用户难以辨别自己是否真的在使用所付费的旗舰模型,这本质上是一种信息不对称交易。 其次,便宜Token的性价比需要理性看待。其“低价感”常源于与官方API按量价的对比,若与官方订阅套餐、国产模型或免费额度相比,未必总是最优。讨论强调用户应先明确自身需求——是偶尔使用还是高频调用,再选择合适渠道。 便宜Token的来源复杂,既可能有批量采购、缓存优化等合法路径,也可能涉及订阅拆分、地区价差套利甚至更灰色的渠道。这种混合供给导致服务稳定性和余额风险难以评估。真正的成本计算需涵盖模型真实性、服务稳定性和数据安全。 数据安全成为核心关切,尤其在AI编程、Agent和企业应用场景中。经过中转站的prompt、代码、业务文档和密钥可能面临泄露风险。对于企业,这还涉及商业秘密、数据合规与供应商审查等治理问题。 讨论形成的普遍共识是:AI中转站可用于低敏感、可替代的任务(如公开资料总结、简单测试),但不建议作为默认入口,尤其不能用于处理敏感数据或接入生产环境。实用建议包括:避免大额充值、分散风险、定期测试模型、做好数据脱敏。 这场讨论揭示,当AI能力按Token计价时,用户为节省调用费用,可能潜在地牺牲了信任与安全。随着AI更深度融入工作流,明晰请求路径、模型来源与数据流向变得至关重要。

marsbit1小时前

AI中转站引发知乎热议:便宜Token背后,用户真正担心什么?

marsbit1小时前

交易

现货
合约
活动图片