Turing Award Laureate Sutton's New Work: Using a Formula from 1967 to Solve a Major Flaw in Streaming Reinforcement Learning

marsbitPublicado em 2026-05-10Última atualização em 2026-05-10

Resumo

New research titled "Intentional Updates for Streaming Reinforcement Learning" (arXiv:2604.19033v1), involving Turing Award laureate Richard Sutton, addresses a core challenge in deep reinforcement learning (RL): the "stream barrier." Current deep RL methods typically rely on replay buffers and batch training for stability, failing catastrophically when learning online from single data points (streaming). The authors propose a fundamental shift: instead of prescribing how far to move parameters (a fixed step size), their "Intentional Updates" method specifies the desired change in the function's output (e.g., a 5% reduction in value prediction error). It then calculates the step size needed to achieve that intent. This idea is inspired by the Normalized Least Mean Squares (NLMS) algorithm from 1967. Applied to value and policy learning, this yields algorithms like Intentional TD(λ) and Intentional AC. The method inherently stabilizes learning by adapting the step size based on the local gradient landscape, preventing overshooting/undershooting. In experiments on MuJoCo continuous control and Atari discrete tasks, Intentional AC achieved performance rivaling batch-based algorithms like SAC in a streaming setting (batch size=1, no replay buffer), while being ~140x more computationally efficient per update. The work demonstrates significant robustness, reducing reliance on numerous stabilization tricks. A remaining challenge is bias in policy updates due to action-dependent s...

At the end of 2024, a paper titled "Streaming Deep Reinforcement Learning Finally Works" (arXiv:2410.14606) sparked widespread discussion in the academic community. The authors, from Mahmood's team at the University of Alberta, spent considerable effort describing an embarrassing reality: reinforcement learning, a method that is inherently "learn-as-you-go," has almost become incapable of doing so in the era of deep neural networks. If you simply remove the replay buffer and set the batch size to 1, training collapses. They called this the "stream barrier".

That paper proposed the StreamX series of algorithms, which barely scaled this wall through meticulous tuning of hyperparameters, sparse initialization, and various stabilization techniques.

However, less than a year and a half later, a member of the same research group, along with collaborators from the Openmind Institute, provided a distinctly different answer: the root cause of the stream barrier is not "insufficient data," but "the step size having the wrong unit."

Paper title: Intentional Updates for Streaming Reinforcement Learning

Paper link: https://arxiv.org/pdf/2604.19033v1

Code repository: https://github.com/sharifnassab/Intentional_RL

Stepping on the Gas, How Big a Hole Does It Dig?

Imagine you're learning to parallel park a car. The instructor tells you to "press the gas pedal for 0.1 seconds" each time. The problem is, pressing for the same 0.1 seconds can result in vastly different distances traveled depending on whether you're going uphill, downhill, empty, or fully loaded. Sometimes you're off by a centimeter and park perfectly, other times you're off by 30 centimeters and hit the wall.

Traditional gradient learning step sizes do precisely this: they dictate how much the parameters should move, but exert no control over how much the function's output actually changes. In batch training, the errors of hundreds or thousands of samples are averaged, diluting extreme cases, so the problem isn't obvious. But in a "streaming" environment, where each step involves only one sample, there is no averaging. Once the gradient direction becomes unstable, the magnitude of updates can swing wildly—moving forward 30 cm today, backward 50 cm tomorrow—causing the learning process to collapse amid violent oscillations.

This phenomenon of "overshooting and undershooting" is particularly severe in reinforcement learning because the gradient at each timestep not only varies in magnitude but also changes direction rapidly.

Redefining "How Much a Step Should Do"

In a recent paper, Arsalan Sharifnassab from the Openmind Institute, along with Mohamed Elsayed, A. Rupam Mahmood, and Richard Sutton from the University of Alberta, proposed a solution from a different angle: Instead of specifying how much the parameters should move, directly specify how much the function's output should change.

This idea is not entirely new. In 1967, Japanese scholars Nagumo and Noda, in their paper "A learning method for system identification," proposed the "Normalized Least Mean Squares" (NLMS) algorithm in the field of adaptive filtering; its essence is also using the desired output change to deduce the step size, not the other way around. However, that algorithm was only suitable for simple linear scenarios.

The researchers extended this idea to deep reinforcement learning. They call it "Intentional Updates": before each update, first clarify "what I hope to achieve with this step," then deduce the step size that should be used.

For value learning (i.e., predicting future rewards), their defined intention is: after each update, the prediction error for the current state's value should shrink by a fixed proportion—for example, by 5%, no more, no less. For policy learning (i.e., optimizing decision-making actions), their defined intention is: the probability of selecting the current action is only allowed to change by a "moderate" amount each step.

Using the driving metaphor: this is like the driver deciding before each operation, "I want the car to move forward 20 cm," then automatically calculating how deep to press the gas pedal based on current road conditions (gradient, load), instead of pressing the same depth each time and leaving it to fate.

The Turing Award Laureate and His Puzzle

One of the paper's signatories is Richard S. Sutton—the 2024 Turing Award laureate, widely regarded as the "father of modern reinforcement learning."

Sutton's stature in academia is roughly equivalent to that of Feynman in physics: he not only proposed the Temporal Difference (TD learning) and Policy Gradient frameworks, the foundations of modern reinforcement learning, but also co-authored, with Andrew Barto, the field's most authoritative textbook, "Reinforcement Learning: An Introduction" (now in its second edition, available online for free). He shared the 2024 Turing Award with Barto, with the award citation reading, "for laying the conceptual and algorithmic foundations of reinforcement learning."

After receiving the award, Sutton did not retire but instead invested the prize money into the Openmind Institute he founded, specifically funding young researchers willing to "explore fundamental problems in an environment free from commercial pressure." This new paper emerged from this non-profit institution.

And the paper's first author, Sharifnassab, had recently published the MetaOptimize framework at ICML 2025, researching how to automatically tune learning rates online. The focus of both topics is highly consistent: how to make the step size itself more intelligent.

Algorithm Details: Simpler Than Imagined

The mathematical derivation of "Intentional Updates" is not complex; its core formula can be described in one sentence: the step size equals the "desired output change" divided by the "actual influence of the gradient direction on the output."

In value learning, this "actual influence" is the norm of the gradient vector (essentially measuring how "steep" the current parameter region is): step sizes are smaller in steeper areas and larger in flatter areas, ensuring the impact of each update on the value function remains consistent.

In policy learning, the "desired change" is defined to be proportional to the advantage function: how much better the current action is compared to the average determines how much the policy moves in that direction—normalized in magnitude through a running average, ensuring that over the long term, the magnitude of policy changes remains stable within an interpretable range.

The researchers also combined this core idea with two engineering practices: RMSProp-style diagonal scaling (handling differences in magnitude across parameter dimensions) and eligibility traces (helping reward signals propagate to past timesteps).

This ultimately forms three complete algorithms: Intentional TD (λ) for value prediction, Intentional Q (λ) for discrete action control, and Intentional Policy Gradient for continuous control.

Experimental Results: Matching SAC Even Without GPUs

The paper evaluated this approach on multiple standard benchmarks, with impressive results.

On MuJoCo continuous control tasks (including complex simulated robots like Ant, Humanoid, HalfCheetah), the new method, Intentional AC, in a streaming setup (batch size = 1, no replay buffer), achieved final performance that repeatedly came close to or even matched SAC—an algorithm that uses large-batch replay buffers and is almost the gold standard for current continuous control tasks. In terms of computational cost, each Intentional AC update required only about 1/140th of the floating-point operations of a single SAC update.

On Atari and MinAtar discrete-action games, Intentional Q-learning performed comparably to DQN, which uses a replay buffer, and successfully ran all tasks with the same set of hyperparameters, without requiring per-task tuning.

The researchers also specifically verified whether the "intention" was truly realized: they measured the ratio of actual update magnitude to intended update magnitude. In a simplified setting with eligibility traces disabled, the standard deviation of this ratio was only 0.016 to 0.029, with the 99th percentile all within 1.07; meaning that in the vast majority of cases, the updates indeed achieved "exactly what they were supposed to do."

Furthermore, an ablation study showed that performance declined somewhat but remained competitive after removing RMSProp normalization or the σ term, with this "intentional scaling" itself being the primary contributor, while other components were auxiliary.

Problems Remain

The "Intentional Update" framework also demonstrated significant advantages in robustness. When the researchers removed, one by one, the various stabilizing auxiliary techniques (sparse initialization, reward scaling, input normalization, LayerNorm) that the StreamX method relied on, Intentional AC's performance degradation was significantly less than that of the original StreamAC, indicating that intentional scaling reduces reliance on external "crutches" at the root.

However, the paper also candidly addresses a not-yet-fully-resolved issue: in policy learning, the step size depends on the currently sampled action, which implicitly assigns different "weights" to different actions and may alter the expected direction of the policy gradient. In Humanoid and HumanoidStandup tasks, by measuring the cosine similarity of expected update directions, the researchers found this bias was close to 0.96 (almost negligible) during critical learning phases; but in Ant-v4, the alignment dropped to a median of 0.63, indicating the problem cannot always be ignored.

The authors point out that future research should seek step-size selection strategies independent of the action, keeping the "intention" unbiased in expectation as well. This is a clear assignment left for future researchers in this direction.

Conclusion: Enabling AI to Learn Like Humans, On the Job

The current mainstream paradigm for training large models relies on batch digestion of massive data: feeding in all the text and code from the internet, repeatedly iterating until astonishing capabilities emerge. This path has proven effective, but it is fundamentally "learn first, use later": once training is complete, the model is frozen, unable to continuously update from subsequent real-world interactions.

What streaming reinforcement learning pursues is another, completely different learning mode: not relying on massive replay, not relying on huge GPU clusters, converting every single experience immediately into a parameter update, continuously, cheaply, and adaptively. This is closer to how humans and animals actually learn.

From the initial breakthrough of "finally working" by Elsayed et al. in 2024, to the "Intentional Update" principle proposed in this paper, streaming deep reinforcement learning is maturing at a surprisingly rapid pace. It will not replace batch-trained large models, but for applications requiring long-term online adaptation—like robots, edge devices, and any scenario that cannot afford large replay buffers and GPU clusters—this path is becoming increasingly compelling.

The step size is not just a hyperparameter; it is the AI's commitment to "how much it intends to do" with each step. When this commitment finally becomes controllable, learning itself stabilizes.

This article is from the WeChat public account "Almost Human" (ID: almosthuman2014), author: someone interested in RL.

Perguntas relacionadas

QWhat is the 'stream barrier' problem described in the article?

AThe 'stream barrier' refers to a major difficulty in deep reinforcement learning where the training process collapses when using a streaming setup—meaning no replay buffer and a batch size of one. This prevents the agent from learning effectively from individual, real-time experiences, which is a fundamental characteristic reinforcement learning should possess.

QWhat is the core principle behind the 'Intentional Updates' method proposed in the paper?

AThe core principle of 'Intentional Updates' is to specify how much the function's output (e.g., a value prediction) should change after a parameter update, rather than specifying how much the parameters themselves should move. It inverts the traditional approach by using the desired output change to determine the appropriate step size for the update, leading to more stable learning in a streaming environment.

QHow does the Intentional Updates method relate to historical work from 1967?

AThe idea is conceptually linked to the 1967 Normalized Least Mean Squares (NLMS) algorithm by Nagumo and Noda, which used the expected output change to determine the step size for adaptive filtering. The new paper generalizes this core idea from simple linear settings to the complex, non-linear function approximation context of deep reinforcement learning.

QWhat are some key performance results of the Intentional AC algorithm mentioned in the article?

AIn MuJoCo continuous control tasks with a strict streaming setup (batch size=1, no replay buffer), the Intentional AC algorithm achieved final performance close to or on par with SAC, a state-of-the-art method that uses large batch replay buffers. Furthermore, each Intentional AC update required about 1/140th the floating-point operations (FLOPS) of a single SAC update.

QWhat is a limitation or open problem acknowledged for the Intentional Updates method, particularly in policy learning?

AIn policy learning, the step size depends on the currently sampled action. This can implicitly assign different weights to different actions, potentially biasing the expected direction of the policy gradient. The paper notes that while this bias is negligible in some tasks, it can be more significant in others (e.g., Ant-v4), indicating a need for future research into action-independent step size selection strategies.

Leituras Relacionadas

Base Under Pressure

**Title: The Pressure Mounts for Base** Base, the Ethereum Layer 2 scaling solution backed by Coinbase, is facing significant pressure and public scrutiny from its leadership following the launch of Robinhood Chain. Base co-founder Jesse Pollak recently acknowledged strategic missteps, admitting that the chain's past focus on social and creator tokens (e.g., through Farcaster, Zora) failed to deliver sustainable adoption. He has refocused on core infrastructure, handing leadership of the Base App back to Coinbase's Cobie. While Base remains a top L2 contender alongside OP Mainnet and Arbitrum, and boasts the highest TVL (nearly $12B), its weaknesses are being highlighted by the new competitor. Key criticisms include its slow progress on decentralization. Base has faced issues with its single sequencer causing block production halts, and L2BEAT is reportedly considering downgrading its decentralization rating from Stage 1 to Stage 0. This contrasts sharply with the rapid initial success of Robinhood Chain, whose DEX quickly entered the top five by volume. The leadership styles of the parent companies are also being compared: Robinhood's CEO actively engages with new projects, while a recent incident where Coinbase's Brian Armstrong briefly changed his profile picture—sparking and then crashing a related meme token—drew community ire and mockery. Pollak stated Base is working with Coinbase on tokenized stocks backed 1:1 by real equity, differentiating it from Robinhood's derivatives model. However, the article argues that Base's most urgent task is to address its long-standing technical and trust issues. With more traditional finance players likely to emulate Robinhood's path, Base must use this competitive pressure to solidify its position as long-term financial infrastructure.

Foresight NewsHá 2m

Foresight NewsHá 2m

White House Concession Removes Ethical Hurdle, Clarity Act Races Against Final Window Before Recess?

On July 21st, industry sources reported that the Trump administration has agreed to include an ethics provision in the "Clarity Act" (Digital Asset Market Clarity Act of 2025). This concession addresses the long-standing conflict-of-interest concerns regarding government officials and the crypto industry, potentially removing the final major obstacle to the bill's progress. Additionally, Patrick Witt, the executive director of the White House's Digital Asset Advisory Committee, confirmed he will remain in his role to help finalize the bill, alleviating previous concerns about his potential departure. The Clarity Act aims to establish a unified federal regulatory framework for the U.S. digital asset market. Its core objective is to resolve regulatory ambiguity by defining different types of digital assets (digital commodities, investment contract assets, and permitted payment stablecoins) and clarifying the respective oversight roles of the SEC and CFTC. This would end the long-running jurisdictional dispute between the two agencies and provide clearer compliance paths for the industry. With the ethics issue moving toward resolution, the most urgent challenge now is time. The U.S. Congress is set to begin its August recess in mid-August, leaving only a few working weeks to finalize the text and advance the bill through the Senate. Industry advocates, like the Blockchain Association's Kristin Smith, stress that this is a critical moment. If negotiations conclude successfully in the coming weeks, the Clarity Act could pass a key hurdle before the recess; otherwise, it may face significant delays. If enacted, the Clarity Act could mark a historic turning point in crypto regulation. By providing a clearer and more predictable legal framework, it aims to reduce uncertainty for businesses, developers, and traditional financial institutions looking to enter the digital asset space, potentially setting a global benchmark for market structure regulation.

Odaily星球日报Há 8m

White House Concession Removes Ethical Hurdle, Clarity Act Races Against Final Window Before Recess?

Odaily星球日报Há 8m

Midnight’s 515M NIGHT hack sends token down 32% – Will $0.015 hold?

In July 2026, the Midnight network was hacked, with 515 million NIGHT tokens drained from a cross-chain bridge contract. The attacker sold a large portion, causing the token price to crash 32% to an all-time low of $0.015. This triggered panic selling, spiking trading volume and driving market indicators into deeply oversold territory. While the Midnight Foundation stated its core network remained secure, the incident left the wrapped token on BNB unbacked. Analysts warn of continued bearish pressure, with the key question being whether the $0.015 support level will hold.

ambcryptoHá 13m

Midnight’s 515M NIGHT hack sends token down 32% – Will $0.015 hold?

ambcryptoHá 13m

AI Era, Industrial Revolution, and Future Civilization Interview — Zhang Dingwen: The Future Does Not Belong to Chasers

"AI Era, Industrial Revolution and Future Civilization: An Interview with Zhang Dingwen – The Future Does Not Belong to Those Who Chase" In this interview, entrepreneur Zhang Dingwen reflects on his entrepreneurial journey and philosophy, moving beyond discussions of financing or success to emphasize understanding the "era" itself. He argues that true entrepreneurs should not chase short-term trends ("winds"), but position themselves in the direction of long-term technological and societal evolution. Zhang shares key lessons from his early days, including the realization that user value does not automatically translate to commercial value. For him, the core of entrepreneurship is not building a company but constantly upgrading one's own "cognition" – the ability to interpret information, ask the right questions, and understand the underlying "causes" behind business outcomes, not just the effects. His thinking has evolved from a focus on creating good products to a strategic focus on building "entrances" – platforms that naturally connect users to digital services. He sees smart wearables, like watches, not merely as hardware but as potential future gateways combining technological, financial, social, and even fashion attributes to create sustained user relationships and ecosystems. Ultimately, Zhang's vision transcends individual products or companies. He discusses business competition in three stages: product, platform, and finally, "civilization" – where the greatest companies influence how society operates by defining new rules and ways of life. He believes the mission of a truly great enterprise is to solve problems of its time, build enduring trust, and contribute lasting value, leaving behind not just wealth but a positive impact on how the world works. The future, he concludes, belongs not to the fastest, but to those with the correct long-term direction and a commitment to continuous learning and evolution.

marsbitHá 19m

AI Era, Industrial Revolution, and Future Civilization Interview — Zhang Dingwen: The Future Does Not Belong to Chasers

marsbitHá 19m

Cryptocurrency & Stock Market Barometer丨Strategy Cash Reserves Increase to $3.23 Billion, Halting BTC Purchases; Vanguard and Other Asset Managers Increase Holdings in Strategy Stock (July 21)

Market Overview & Warnings: The article warns of high volatility in South Korean stocks and continued dependence on U.S. stocks on geopolitics. Chinese A-shares remain under pressure. It advises against using leverage in current equity markets. For crypto-linked stocks, most have limited growth except Robinhood, with caution advised. U.S. Stock Market: Bearish bets on U.S. stocks, particularly targeting AI-related companies, have reached record highs since 2010, signaling deep skepticism about the sustainability of the AI-driven rally. Tech and chip stocks led a market decline, with the Philadelphia Semiconductor Index potentially entering a bear market. Increased expectations for Federal Reserve interest rate hikes and geopolitical tensions contributed to the negative sentiment. Bitcoin Treasury Company Updates: * Strategy: Increased its cash reserves to $3.23 billion and paused Bitcoin purchases. Several major asset managers, including Vanguard Group and Capital Group, increased their holdings of Strategy (MSTR) stock. * Global corporate Bitcoin buying slowed significantly to just $1.33 million last week. * Other notable activity: Strive purchased 21 BTC; ORANGE JUICE raised $40 million for Bitcoin acquisitions; Bitcoin Japan Corp. raised $60 million, allocating $4.08 million for its first BTC purchase. Other Crypto Treasury Holdings: * Ethereum: BitMine increased its ETH holdings to 5.78 million, nearing its 5% of supply goal. Its total crypto assets, cash, and securities are valued at $11.5 billion. * Solana: No significant corporate treasury activity reported. * Altcoins: HypeStrat made no adjustments to its treasury; its mNAV ratio fell to a long-term low. (Note: This summary is for informational purposes only and does not constitute investment advice.)

marsbitHá 20m

Cryptocurrency & Stock Market Barometer丨Strategy Cash Reserves Increase to $3.23 Billion, Halting BTC Purchases; Vanguard and Other Asset Managers Increase Holdings in Strategy Stock (July 21)

marsbitHá 20m

Trading

Spot

Turing Award Laureate Sutton's New Work: Using a Formula from 1967 to Solve a Major Flaw in Streaming Reinforcement Learning

Resumo

Stepping on the Gas, How Big a Hole Does It Dig?

The Turing Award Laureate and His Puzzle

Algorithm Details: Simpler Than Imagined

Experimental Results: Matching SAC Even Without GPUs

Problems Remain

Conclusion: Enabling AI to Learn Like Humans, On the Job

Perguntas relacionadas

Leituras Relacionadas

Base Under Pressure

White House Concession Removes Ethical Hurdle, Clarity Act Races Against Final Window Before Recess?

Midnight’s 515M NIGHT hack sends token down 32% – Will $0.015 hold?

AI Era, Industrial Revolution, and Future Civilization Interview — Zhang Dingwen: The Future Does Not Belong to Chasers

Cryptocurrency & Stock Market Barometer丨Strategy Cash Reserves Increase to $3.23 Billion, Halting BTC Purchases; Vanguard and Other Asset Managers Increase Holdings in Strategy Stock (July 21)

Trading

Categorias populares

Etiquetas Populares