Turing Award Laureate Sutton's New Work: Using a Formula from 1967 to Solve a Major Flaw in Streaming Reinforcement Learning

marsbitPublished on 2026-05-10Last updated on 2026-05-10

Abstract

New research titled "Intentional Updates for Streaming Reinforcement Learning" (arXiv:2604.19033v1), involving Turing Award laureate Richard Sutton, addresses a core challenge in deep reinforcement learning (RL): the "stream barrier." Current deep RL methods typically rely on replay buffers and batch training for stability, failing catastrophically when learning online from single data points (streaming). The authors propose a fundamental shift: instead of prescribing how far to move parameters (a fixed step size), their "Intentional Updates" method specifies the desired change in the function's output (e.g., a 5% reduction in value prediction error). It then calculates the step size needed to achieve that intent. This idea is inspired by the Normalized Least Mean Squares (NLMS) algorithm from 1967. Applied to value and policy learning, this yields algorithms like Intentional TD(λ) and Intentional AC. The method inherently stabilizes learning by adapting the step size based on the local gradient landscape, preventing overshooting/undershooting. In experiments on MuJoCo continuous control and Atari discrete tasks, Intentional AC achieved performance rivaling batch-based algorithms like SAC in a streaming setting (batch size=1, no replay buffer), while being ~140x more computationally efficient per update. The work demonstrates significant robustness, reducing reliance on numerous stabilization tricks. A remaining challenge is bias in policy updates due to action-dependent s...

At the end of 2024, a paper titled "Streaming Deep Reinforcement Learning Finally Works" (arXiv:2410.14606) sparked widespread discussion in the academic community. The authors, from Mahmood's team at the University of Alberta, spent considerable effort describing an embarrassing reality: reinforcement learning, a method that is inherently "learn-as-you-go," has almost become incapable of doing so in the era of deep neural networks. If you simply remove the replay buffer and set the batch size to 1, training collapses. They called this the "stream barrier".

That paper proposed the StreamX series of algorithms, which barely scaled this wall through meticulous tuning of hyperparameters, sparse initialization, and various stabilization techniques.

However, less than a year and a half later, a member of the same research group, along with collaborators from the Openmind Institute, provided a distinctly different answer: the root cause of the stream barrier is not "insufficient data," but "the step size having the wrong unit."

Paper title: Intentional Updates for Streaming Reinforcement Learning

Paper link: https://arxiv.org/pdf/2604.19033v1

Code repository: https://github.com/sharifnassab/Intentional_RL

Stepping on the Gas, How Big a Hole Does It Dig?

Imagine you're learning to parallel park a car. The instructor tells you to "press the gas pedal for 0.1 seconds" each time. The problem is, pressing for the same 0.1 seconds can result in vastly different distances traveled depending on whether you're going uphill, downhill, empty, or fully loaded. Sometimes you're off by a centimeter and park perfectly, other times you're off by 30 centimeters and hit the wall.

Traditional gradient learning step sizes do precisely this: they dictate how much the parameters should move, but exert no control over how much the function's output actually changes. In batch training, the errors of hundreds or thousands of samples are averaged, diluting extreme cases, so the problem isn't obvious. But in a "streaming" environment, where each step involves only one sample, there is no averaging. Once the gradient direction becomes unstable, the magnitude of updates can swing wildly—moving forward 30 cm today, backward 50 cm tomorrow—causing the learning process to collapse amid violent oscillations.

This phenomenon of "overshooting and undershooting" is particularly severe in reinforcement learning because the gradient at each timestep not only varies in magnitude but also changes direction rapidly.

Redefining "How Much a Step Should Do"

In a recent paper, Arsalan Sharifnassab from the Openmind Institute, along with Mohamed Elsayed, A. Rupam Mahmood, and Richard Sutton from the University of Alberta, proposed a solution from a different angle: Instead of specifying how much the parameters should move, directly specify how much the function's output should change.

This idea is not entirely new. In 1967, Japanese scholars Nagumo and Noda, in their paper "A learning method for system identification," proposed the "Normalized Least Mean Squares" (NLMS) algorithm in the field of adaptive filtering; its essence is also using the desired output change to deduce the step size, not the other way around. However, that algorithm was only suitable for simple linear scenarios.

The researchers extended this idea to deep reinforcement learning. They call it "Intentional Updates": before each update, first clarify "what I hope to achieve with this step," then deduce the step size that should be used.

For value learning (i.e., predicting future rewards), their defined intention is: after each update, the prediction error for the current state's value should shrink by a fixed proportion—for example, by 5%, no more, no less. For policy learning (i.e., optimizing decision-making actions), their defined intention is: the probability of selecting the current action is only allowed to change by a "moderate" amount each step.

Using the driving metaphor: this is like the driver deciding before each operation, "I want the car to move forward 20 cm," then automatically calculating how deep to press the gas pedal based on current road conditions (gradient, load), instead of pressing the same depth each time and leaving it to fate.

The Turing Award Laureate and His Puzzle

One of the paper's signatories is Richard S. Sutton—the 2024 Turing Award laureate, widely regarded as the "father of modern reinforcement learning."

Sutton's stature in academia is roughly equivalent to that of Feynman in physics: he not only proposed the Temporal Difference (TD learning) and Policy Gradient frameworks, the foundations of modern reinforcement learning, but also co-authored, with Andrew Barto, the field's most authoritative textbook, "Reinforcement Learning: An Introduction" (now in its second edition, available online for free). He shared the 2024 Turing Award with Barto, with the award citation reading, "for laying the conceptual and algorithmic foundations of reinforcement learning."

After receiving the award, Sutton did not retire but instead invested the prize money into the Openmind Institute he founded, specifically funding young researchers willing to "explore fundamental problems in an environment free from commercial pressure." This new paper emerged from this non-profit institution.

And the paper's first author, Sharifnassab, had recently published the MetaOptimize framework at ICML 2025, researching how to automatically tune learning rates online. The focus of both topics is highly consistent: how to make the step size itself more intelligent.

Algorithm Details: Simpler Than Imagined

The mathematical derivation of "Intentional Updates" is not complex; its core formula can be described in one sentence: the step size equals the "desired output change" divided by the "actual influence of the gradient direction on the output."

In value learning, this "actual influence" is the norm of the gradient vector (essentially measuring how "steep" the current parameter region is): step sizes are smaller in steeper areas and larger in flatter areas, ensuring the impact of each update on the value function remains consistent.

In policy learning, the "desired change" is defined to be proportional to the advantage function: how much better the current action is compared to the average determines how much the policy moves in that direction—normalized in magnitude through a running average, ensuring that over the long term, the magnitude of policy changes remains stable within an interpretable range.

The researchers also combined this core idea with two engineering practices: RMSProp-style diagonal scaling (handling differences in magnitude across parameter dimensions) and eligibility traces (helping reward signals propagate to past timesteps).

This ultimately forms three complete algorithms: Intentional TD (λ) for value prediction, Intentional Q (λ) for discrete action control, and Intentional Policy Gradient for continuous control.

Experimental Results: Matching SAC Even Without GPUs

The paper evaluated this approach on multiple standard benchmarks, with impressive results.

On MuJoCo continuous control tasks (including complex simulated robots like Ant, Humanoid, HalfCheetah), the new method, Intentional AC, in a streaming setup (batch size = 1, no replay buffer), achieved final performance that repeatedly came close to or even matched SAC—an algorithm that uses large-batch replay buffers and is almost the gold standard for current continuous control tasks. In terms of computational cost, each Intentional AC update required only about 1/140th of the floating-point operations of a single SAC update.

On Atari and MinAtar discrete-action games, Intentional Q-learning performed comparably to DQN, which uses a replay buffer, and successfully ran all tasks with the same set of hyperparameters, without requiring per-task tuning.

The researchers also specifically verified whether the "intention" was truly realized: they measured the ratio of actual update magnitude to intended update magnitude. In a simplified setting with eligibility traces disabled, the standard deviation of this ratio was only 0.016 to 0.029, with the 99th percentile all within 1.07; meaning that in the vast majority of cases, the updates indeed achieved "exactly what they were supposed to do."

Furthermore, an ablation study showed that performance declined somewhat but remained competitive after removing RMSProp normalization or the σ term, with this "intentional scaling" itself being the primary contributor, while other components were auxiliary.

Problems Remain

The "Intentional Update" framework also demonstrated significant advantages in robustness. When the researchers removed, one by one, the various stabilizing auxiliary techniques (sparse initialization, reward scaling, input normalization, LayerNorm) that the StreamX method relied on, Intentional AC's performance degradation was significantly less than that of the original StreamAC, indicating that intentional scaling reduces reliance on external "crutches" at the root.

However, the paper also candidly addresses a not-yet-fully-resolved issue: in policy learning, the step size depends on the currently sampled action, which implicitly assigns different "weights" to different actions and may alter the expected direction of the policy gradient. In Humanoid and HumanoidStandup tasks, by measuring the cosine similarity of expected update directions, the researchers found this bias was close to 0.96 (almost negligible) during critical learning phases; but in Ant-v4, the alignment dropped to a median of 0.63, indicating the problem cannot always be ignored.

The authors point out that future research should seek step-size selection strategies independent of the action, keeping the "intention" unbiased in expectation as well. This is a clear assignment left for future researchers in this direction.

Conclusion: Enabling AI to Learn Like Humans, On the Job

The current mainstream paradigm for training large models relies on batch digestion of massive data: feeding in all the text and code from the internet, repeatedly iterating until astonishing capabilities emerge. This path has proven effective, but it is fundamentally "learn first, use later": once training is complete, the model is frozen, unable to continuously update from subsequent real-world interactions.

What streaming reinforcement learning pursues is another, completely different learning mode: not relying on massive replay, not relying on huge GPU clusters, converting every single experience immediately into a parameter update, continuously, cheaply, and adaptively. This is closer to how humans and animals actually learn.

From the initial breakthrough of "finally working" by Elsayed et al. in 2024, to the "Intentional Update" principle proposed in this paper, streaming deep reinforcement learning is maturing at a surprisingly rapid pace. It will not replace batch-trained large models, but for applications requiring long-term online adaptation—like robots, edge devices, and any scenario that cannot afford large replay buffers and GPU clusters—this path is becoming increasingly compelling.

The step size is not just a hyperparameter; it is the AI's commitment to "how much it intends to do" with each step. When this commitment finally becomes controllable, learning itself stabilizes.

This article is from the WeChat public account "Almost Human" (ID: almosthuman2014), author: someone interested in RL.

Related Questions

QWhat is the 'stream barrier' problem described in the article?

AThe 'stream barrier' refers to a major difficulty in deep reinforcement learning where the training process collapses when using a streaming setup—meaning no replay buffer and a batch size of one. This prevents the agent from learning effectively from individual, real-time experiences, which is a fundamental characteristic reinforcement learning should possess.

QWhat is the core principle behind the 'Intentional Updates' method proposed in the paper?

AThe core principle of 'Intentional Updates' is to specify how much the function's output (e.g., a value prediction) should change after a parameter update, rather than specifying how much the parameters themselves should move. It inverts the traditional approach by using the desired output change to determine the appropriate step size for the update, leading to more stable learning in a streaming environment.

QHow does the Intentional Updates method relate to historical work from 1967?

AThe idea is conceptually linked to the 1967 Normalized Least Mean Squares (NLMS) algorithm by Nagumo and Noda, which used the expected output change to determine the step size for adaptive filtering. The new paper generalizes this core idea from simple linear settings to the complex, non-linear function approximation context of deep reinforcement learning.

QWhat are some key performance results of the Intentional AC algorithm mentioned in the article?

AIn MuJoCo continuous control tasks with a strict streaming setup (batch size=1, no replay buffer), the Intentional AC algorithm achieved final performance close to or on par with SAC, a state-of-the-art method that uses large batch replay buffers. Furthermore, each Intentional AC update required about 1/140th the floating-point operations (FLOPS) of a single SAC update.

QWhat is a limitation or open problem acknowledged for the Intentional Updates method, particularly in policy learning?

AIn policy learning, the step size depends on the currently sampled action. This can implicitly assign different weights to different actions, potentially biasing the expected direction of the policy gradient. The paper notes that while this bias is negligible in some tasks, it can be more significant in others (e.g., Ant-v4), indicating a need for future research into action-independent step size selection strategies.

Related Reads

Bloomberg Uncovered: How Do China's Wealthy Circumvent the Annual $50,000 Limit to Transfer Assets?

**Summary: How Wealthy Chinese Circumvent $50,000 Annual Foreign Exchange Limits** Despite China's strict capital controls, including an annual $50,000 per person foreign exchange quota, an estimated $150 billion in funds still leaves the country annually via various gray and underground channels. This report outlines the evolution of China's "capital wall" and the methods used to bypass it. **The Evolving Capital Controls:** * **Foundation (1994):** The system of "current account convertibility with strict capital account controls" was established. * **Quota Set (2007):** The $50,000 individual annual forex purchase limit was formalized. * **Crackdown Begins (2015-2017):** Following market volatility, enforcement tightened. Banks were required to scrutinize transactions, and channels like using UnionPay cards for Hong Kong insurance premiums or buying overseas property were blocked. * **Digital & Legal Upgrades (2024-2026):** Enhanced algorithms now flag suspicious patterns (e.g., "smurfing"). The Common Reporting Standard (CRS) provides Chinese tax authorities with data on citizens' offshore accounts. Unlicensed cross-border brokers have been targeted. **Five Primary Methods for Moving Capital:** 1. **Underground Banking / "Hawala" (Duiqiao):** The largest-scale method. No money crosses borders. Clients pay RMB to a domestic account; an overseas associate deposits equivalent foreign currency into the client's offshore account. Risks include high fees, account freezes, and legal penalties. 2. **"Smurfing" or "Ant Moving":** Using multiple individuals' $50,000 quotas to pool funds for one offshore recipient. Increasingly detected by anti-money laundering algorithms. 3. **Trade Invoice Manipulation:** Businesses over-invoice imports or under-invoice exports via offshore shell companies, creating a pretext to transfer excess funds abroad under the guise of trade. 4. **Channel Migration:** After a crackdown on internet brokers, funds flow toward more compliant but costly channels like major banks' cross-border wealth management services or Qualified Domestic Institutional Investor (QDII) quotas. 5. **Structural Arrangements:** High-net-worth individuals use complex, high-cost legal structures involving offshore trusts, insurance, and investment migration programs to transfer asset ownership. **Regulatory Response: Focusing on People, Not Just Money** The current strategy extends oversight from enterprises to **individual residents**. Tools like CRS allow retroactive visibility into offshore assets. Cryptocurrencies, once seen as a potential loophole, are now actively monitored and prosecuted as an illegal channel. The underlying driver remains: with significant wealth concentrated among millions of affluent households seeking diversification amid domestic economic shifts, the incentive to move assets offshore persists despite regulatory barriers.

marsbit10m ago

Bloomberg Uncovered: How Do China's Wealthy Circumvent the Annual $50,000 Limit to Transfer Assets?

marsbit10m ago

Ethereum's Ballmer Moment: As Everyone Is Bearish, the Circulating Supply Is Disappearing

"Ethereum's Ballmer Moment: Circulation Shrinks Amid Bearish Sentiment" Amid widespread bearish sentiment, with prominent figures like Bankless founder David Hoffman selling ETH and young developers flocking to Solana, some argue Ethereum is entering its "Ballmer era"—akin to Microsoft's perceived stagnation under Steve Ballmer. While surface-level criticisms about slow protocol development, cautious leadership, and competitive pressure are valid, underlying fundamentals tell a different story. Approximately 30% of ETH is staked, major holders like BitMine are accumulating, and spot ETFs continue to absorb supply. Regulatory clarity, including the SEC/CFTC's March ruling on staking rewards and the potential passage of the CLARITY Act, is transforming crypto from a regulatory threat into a legitimized framework. This institutionalization, alongside a shrinking circulating supply (with net issuance around 0.23% annually), creates significant buy-side pressure independent of fee-based value capture. The broader crypto total addressable market is expanding through regulated stablecoins, tokenized assets, and institutional adoption. While public chains face competition from permissioned alternatives, the winning model appears to be permissioned assets settling on public chains like Ethereum and Solana. The author advocates a non-maximalist, barbell strategy: holding ETH for its institutional role and supply squeeze, SOL for consumer/throughput trends, BTC as a macro hedge, and a basket of next-gen L1s. Key bullish drivers for ETH include rapid circulation shrinkage, potential Q2 staked ETF approvals, regulatory tailwinds solidifying its role as a default settlement layer, and the optionality of an eventual "Satya moment" leadership shift. Despite bearish consensus, the current setup—where crypto is "not hot" and regulatory groundwork is being laid—presents a compelling investment opportunity. The crypto cycle's focus may have shifted to AI, but blockchain infrastructure is gaining a legal and institutional foothold precisely while attention is elsewhere.

marsbit11m ago

Ethereum's Ballmer Moment: As Everyone Is Bearish, the Circulating Supply Is Disappearing

marsbit11m ago

Claude Code Introduces Dynamic Workflows: Enabling AI to Form Teams and Collaborate

Claude Code introduces dynamic workflows, enabling AI to coordinate teams of specialized agents for complex tasks. This transforms Claude from a code assistant into a programmable workbench. Workflows address key limitations of single-agent systems: agentic laziness (premature task completion), self-preferential bias (favoring own outputs), and goal drift (losing sight of original objectives). The system allows Claude to dynamically create execution frameworks using JavaScript. It can split tasks, dispatch parallel agents for isolated work (e.g., in separate worktrees), implement adversarial validation, run tournaments, and synthesize results. This multi-agent approach is valuable for tasks requiring deep research, factual verification, code migration, root cause analysis, large-scale triage, and qualitative sorting. Key patterns include: classify-and-route, fan-out-and-synthesize, adversarial verification, generate-and-filter, tournaments, and loop-until-done. While token usage is higher, workflows excel where tasks resemble programming—needing problem decomposition, isolated context, hypothesis testing, and handling many details. They extend Claude Code's utility beyond technical work to areas like business plan review, resume screening, and naming brainstorm. The feature is not a universal solution but points to a future where AI tool competitiveness depends on organizing reliable, reusable, and auditable execution flows for complex goals.

marsbit52m ago

Claude Code Introduces Dynamic Workflows: Enabling AI to Form Teams and Collaborate

marsbit52m ago

Hyperliquid, Wall Street's 24/7 Trading Convenience Store

Hyperliquid: The 24/7 Trading "Convenience Store" for Wall Street Hyperliquid, a decentralized cryptocurrency exchange, has become a go-to platform for Wall Street traders seeking to trade around the clock, especially during traditional market closures. Founded by Jeff Yan, a former quantitative trader, after the FTX collapse, the platform emphasizes user self-custody of assets. It offers a wide range of perpetual contracts—leveraged derivatives with no expiry—on assets from Bitcoin and crude oil to the S&P 500 and even pre-IPO companies like SpaceX. A notable example involves a hedge fund trader who capitalized on geopolitical news over a weekend, securing a 243% return on oil derivatives before markets reopened. The platform, run by just 11 employees, generated approximately $800 million in revenue last year, and its native token HYPE has seen significant growth. Its rise highlights the merging of traditional finance and crypto. While U.S. users are currently restricted, recent CFTC rule changes could open access. The platform is known for its transparency, having processed $10 billion in liquidations during a market crash while competitors faltered. Regulators warn of the high risks and complexity of perpetual contracts for retail investors. Key to its appeal is a strong community culture, direct engagement with founders, and a simple interface. Despite rules against VPN use, it attracts global users with its permissionless approach. Hyperliquid plans to expand into prediction markets and options, aiming to eventually host all financial activity.

marsbit52m ago

Hyperliquid, Wall Street's 24/7 Trading Convenience Store

marsbit52m ago

Trading

Spot
Futures
活动图片