World Models Shift from Prediction to Planning: HWM and the Challenge of Long-Horizon Control

marsbitОпубліковано о 2026-04-17Востаннє оновлено о 2026-04-17

Анотація

World models have evolved from focusing on representation learning and future prediction to addressing long-horizon planning challenges. While models like V-JEPA 2 demonstrate strong predictive capabilities using large-scale video pre-training, they struggle with multi-stage control tasks due to error accumulation and exponential growth in action search space. HWM (Hierarchical World Model) introduces a two-level planning structure: a high-level planner outlines coarse subgoals over longer time horizons, while a low-level executor handles short-term actions. This decomposition reduces planning complexity and error propagation. In experiments, HWM achieved 70% success in real-world robotic tasks where flat models failed entirely. Complementary efforts include V-JEPA (focused on representation), HWM (on hierarchical planning), and WAV (World Action Verifier, on self-correction). Together, they mark a shift from pure world modeling to integrated systems capable of prediction, planning, and verification—key to deploying world models in real-world agents and long-term tasks.

Over the past year, the research focus of world models has initially centered on representation learning and future prediction. Models first understand the world and then internally simulate future states. This approach has already produced a number of representative results. V-JEPA 2 (Video Joint Embedding Predictive Architecture 2—a video world model suite released by Meta in 2025) used over 1 million hours of internet video for pre-training, combined with a small amount of robot interaction data, demonstrating the potential of world models in understanding, prediction, and zero-shot robot planning.

However, a model's ability to predict does not equate to its ability to handle long-horizon tasks. When faced with multi-stage control, systems typically encounter two challenges. One is that prediction errors accumulate over long rollouts (multi-step simulations), causing the entire path to increasingly deviate from the goal. The other is that the action search space expands rapidly as the planning horizon increases, leading to continuously rising planning costs. HWM does not rewrite the underlying learning approach of world models but instead adds a hierarchical planning structure on top of existing action-conditioned world models, enabling the system to first organize stage paths and then handle local actions.

From a technical perspective, V-JEPA 2 (https://ai.meta.com/research/vjepa/) leans more toward world representation and basic prediction, HWM leans more toward long-horizon planning, and WAV (World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry, https://arxiv.org/abs/2604.01985) leans more toward the model's ability to identify and correct its own prediction distortions. These three lines of research are gradually converging. The focus of world model research has shifted from merely predicting the future to transforming predictive capabilities into executable, correctable, and verifiable system capabilities.

I. Why Long-Horizon Control Remains a Bottleneck for World Models

The difficulties of long-horizon control become clearer when applied to robotic tasks. Take robotic arm manipulation as an example: picking up a cup and placing it in a drawer is not a single action but a sequence of continuous steps. The system must approach the object, adjust its posture, complete the grasp, move to the target location, and then handle the drawer and placement. As the chain lengthens, two problems arise simultaneously. One is that prediction errors accumulate along the rollout, and the other is that the action search space expands rapidly.

What the system often lacks is not local predictive ability but the capacity to organize distant goals into stage paths. Many actions may appear to deviate from the goal locally but are actually intermediate steps required to achieve it. For example, raising the arm before grasping or moving back slightly and adjusting the angle before opening a drawer.

In demonstration tasks, world models can already provide coherent predictions. However, when entering real control scenarios, performance begins to decline, and problems emerge. The pressure comes not only from the representation itself but also from the immaturity of the planning layer.

II. How HWM Restructures the Planning Process

HWM splits the originally single-layer planning process into two layers. The upper layer is responsible for stage direction at a longer time scale, while the lower layer handles local execution at a shorter time scale. The model plans at two different temporal rhythms simultaneously, rather than at a single pace.

When handling long tasks, single-layer methods typically need to search the entire action chain directly in the underlying action space. The longer the task, the higher the search cost, and the more likely prediction errors are to diffuse along multi-step rollouts. After HWM's decomposition, the high layer only handles route selection at a longer time scale, and the low layer only handles the execution of the current segment. The entire long task is broken into multiple shorter segments, reducing planning complexity.

Another key design is that high-level actions are not simply the difference between two states but use an encoder to compress a sequence of low-level actions into a higher-level action representation. For long tasks, the key is not just the difference between the start and end points but also how the intermediate steps are organized. If the high layer only looks at displacement differences, it may lose path information in the action chain.

HWM embodies a hierarchical task organization approach. Faced with a multi-stage task, the system no longer unfolds all actions at once but first forms a coarse stage path and then executes and corrects it segment by segment. Once this hierarchical relationship is incorporated into the world model, predictive capabilities begin to transform more stably into planning capabilities.

III. From 0% to 70%: What the Experimental Results Indicate

In the real-world grasp-and-place task set up in the paper, the system was given only the final goal condition without manually decomposed intermediate goals. Under these conditions, HWM achieved a success rate of 70%, while the single-layer world model had a 0% success rate. A long task that was nearly impossible to complete originally became highly achievable after introducing hierarchical planning.

The paper also tested simulation tasks such as object pushing and maze navigation. The results showed that hierarchical planning not only improved success rates but also reduced the computational cost of the planning phase. In some environments, the computational cost of the planning phase could be reduced to about a quarter of the original while maintaining higher or comparable success rates.

IV. From V-JEPA to HWM to WAV

V-JEPA 2 represents the world representation approach. V-JEPA 2 used over 1 million hours of internet video for pre-training, combined with less than 62 hours of robot video for post-training (targeted training after pre-training), resulting in a latent action-conditioned world model (a world model that predicts in an abstract representation space incorporating action information) usable for understanding, predicting, and planning in the physical world. It demonstrates that models can acquire world representations through large-scale observation and transfer these representations to robot planning.

HWM is the next step. The model already possesses world representation and basic predictive capabilities, but once multi-stage control is involved, the problems of error accumulation and search space expansion erupt. HWM does not change the underlying representation learning approach but adds a multi-timescale planning structure on top of existing action-conditioned world models. It addresses how the model organizes distant goals into a set of intermediate steps and then advances segment by segment.

WAV further focuses on verification capabilities. For world models to enter policy optimization and deployment scenarios, they cannot just predict; they must also identify areas where they are prone to distortion and make corrections accordingly. It focuses on how the model checks itself.

V-JEPA leans toward world representation, HWM toward task planning, and WAV toward result verification. Although their focuses differ, their overall direction is consistent. The next phase of world models is no longer just internal prediction but the gradual integration of prediction, planning, and verification into a system capability.

V. From Internal Prediction to Executable Systems

Many past world model efforts were closer to improving the continuity of future state predictions or the stability of internal world representations. However, the current research focus is beginning to change. Systems must not only form judgments about the environment but also translate those judgments into actions and continue to adjust the next steps after results are obtained. To get closer to real deployment, it is necessary to control error propagation in long-horizon tasks, compress the search space, and reduce inference costs.

Such changes will also affect AI agents. Many agent systems can already handle short-chain tasks, such as calling tools, reading files, and executing multi-step commands. However, once tasks become long-chain, multi-stage, and require mid-course re-planning, performance declines. This is not fundamentally different from the difficulties in robotic control; both stem from insufficient high-level path organization capabilities, leading to a disconnect between local execution and overall goals.

The hierarchical approach provided by HWM—where the high level is responsible for paths and stage goals, the low level handles local actions and feedback processing, and result verification is layered on top—will continue to appear in more systems in the future. The next phase of world models will focus not only on predicting the future but on organizing prediction, execution, and correction into a viable path.

Пов'язані питання

QWhat are the main challenges in long-horizon control for world models, as discussed in the article?

AThe main challenges are the accumulation of prediction errors during long rollout sequences, which causes the system to increasingly deviate from the goal, and the exponential expansion of the action search space as the planning horizon grows, leading to rising computational costs.

QHow does HWM (Hierarchical World Model) address the problem of long-term task planning?

AHWM restructures the planning process into two layers: a high-level layer that plans stage directions over longer time scales, and a low-level layer that handles local execution over shorter time scales. This hierarchical approach breaks long tasks into shorter segments, reducing planning complexity and error propagation.

QWhat key improvement did HWM demonstrate in experimental results for real-world tasks?

AIn a real-world grasping and placement task where only the final target was provided without intermediate goals, HWM achieved a 70% success rate, compared to a 0% success rate for a single-layer world model. It also reduced computational costs by up to a quarter in some environments while maintaining high success rates.

QWhat are the distinct focuses of V-JEPA 2, HWM, and WAV in world model research?

AV-JEPA 2 focuses on world representation and foundational prediction using large-scale video pre-training. HWM emphasizes hierarchical task planning for long-horizon control. WAV (World Action Verifier) concentrates on self-verification, identifying and correcting prediction distortions to improve model reliability.

QHow is the research focus of world models evolving beyond mere prediction?

AThe research is shifting from internal future prediction to building executable systems that integrate prediction, planning, and verification. This involves controlling error propagation, compressing search spaces, reducing inference costs, and organizing hierarchical structures for reliable long-term task execution.

Пов'язані матеріали

The Midlife Crisis of Crypto GPs: No PMF, No Next Check from LPs

The article "The Midlife Crisis of Crypto GPs: No PMF, No Next LP Check" analyzes the shifting crypto fundraising landscape. It argues the era of selling grand visions to LPs is over; GPs must now offer products with clear Product-Market Fit (PMF). The author categorizes crypto fundraising products into three types: Primary (VC funds), Liquid (trading strategies), and CeFi/DeFi Native Yield. This summary focuses on the Primary market. Key points include: * **Market Shift:** LPs are impatient, demand immediate returns, and are skeptical of future promises. The "easy money" narrative has faded. * **GP Value Erosion:** LP learning curves have shortened (aided by AI), reducing the value of a GP's basic "crypto knowledge." Superior judgment is now rare. * **Weakened LP Motivations:** Traditional reasons for LPs to invest in crypto VC funds (capturing industry beta, gaining access, leveraging GP judgment) have weakened due to new products like ETFs and increased LP sophistication. * **Surviving in Primary:** The primary market will likely persist for: 1) large funds in endowment mandates treating it as a lottery ticket, 2) family offices/HNWIs using proprietary capital, 3) a few funds with proven recent outperformance, and 4) funds with strong ecosystem "deal-making" capabilities. * **Conclusion:** For most GPs, rebuilding trust requires starting over in a niche, demonstrating alpha-generating ability, or providing concrete value/services to LPs.

marsbit52 хв тому

The Midlife Crisis of Crypto GPs: No PMF, No Next Check from LPs

marsbit52 хв тому

Crypto GPs' Midlife Crisis: No PMF, No LP's Next Check

The article "The Midlife Crisis of Crypto GPs: No PMF, No LP's Next Check" analyzes the shifting crypto fundraising landscape. It argues that the era of LPs funding vague "vision" is over; GPs must now offer products with clear Product-Market Fit (PMF) to secure capital. The market has matured. LPs, disillusioned by the last cycle's failures and wary of long lock-up periods, now demand tangible, near-term returns rather than speculative narratives. The proliferation of accessible crypto ETFs and other liquid products has reduced the need for VC blind pools as an entry point. The author categorizes crypto fundraising products into three types: Primary (VC funds, with blind pools or clear pipelines), Liquid (alpha/beta, directional/market-neutral strategies), and CeFi/DeFi Native Yield (crypto-specific mechanisms like staking, farming). Focusing on the Primary market, the piece details why traditional LP rationales for investing in crypto VCs have weakened: easier beta access via ETFs, diminished "access" and "judgement" premiums as LPs build internal teams, and a widespread lack of proven superior returns from GPs. Ultimately, only specific players are likely to remain at the primary VC table: large funds with access to patient endowment capital, family offices/HNWIs investing proprietary capital, the few funds with demonstrable excess returns from the last cycle, and those with clear "deal-making" or ecosystem resource advantages. For others, the path forward is to rebuild trust by proving alpha-generation capability in a niche or providing concrete, valuable services.

链捕手1 год тому

Crypto GPs' Midlife Crisis: No PMF, No LP's Next Check

链捕手1 год тому

The Age of Decoupling Has Arrived: Bitcoin is No Longer the Sole Compass of Crypto

The era of the cryptocurrency market moving in lockstep with Bitcoin is ending, as the industry splits into two distinct asset categories: endogenous and exogenous. Endogenous assets, like Bitcoin, derive value purely from the crypto market's cycles. Their narratives swing between being "interstellar money" in bull markets and "digital collectibles" in bear markets. Exogenous assets, however, are nominally crypto but operate with independent value drivers. Examples include: * **Venice:** An AI inference service using tokens for payments; its consumer-AI business model is decoupled from crypto price swings. * **Figure:** A fintech lender using blockchain to speed up loan approvals; its core value is in credit, not crypto. * **Stablecoin firms like BVNK:** Acquired by traditional finance giants (Mastercard, Stripe), their growth is tied to payment infrastructure, not market cycles. Hybrid projects like **Hyperliquid** (a decentralized exchange) show a shift, with a growing share of non-crypto trading (e.g., prediction markets). This divergence is fundamental. Endogenous assets remain highly correlated to Bitcoin, similar to gold miners to gold. Exogenous assets are evolving to have their own fundamentals, like the weak correlation between gold and the S&P 500. This changes investment analysis. Evaluating exogenous assets requires traditional fundamental research—assessing user bases, unit economics, and moats—more akin to fintech investing than charting Bitcoin. Promising exogenous sectors include: on-chain exchanges/brokers, AI-crypto fusion, privacy-focused digital banks, lending (institutional/private credit), stablecoins/real-world asset tokenization, payment rails, and non-financial crypto-consumer products. Currently, investing via equity is often safer than via tokens, as token value accrual mechanisms need further regulatory and industry development (e.g., the CLARITY Act). Nonetheless, the core trend is clear: crypto market drivers are diversifying from a single factor (Bitcoin) to multiple fundamentals, ending the era of uniform market moves.

marsbit2 год тому

The Age of Decoupling Has Arrived: Bitcoin is No Longer the Sole Compass of Crypto

marsbit2 год тому

Five Cryptos That Could Outperform Bitcoin Over the Next Cycle Due To Higher Growth Velocity

Bitcoin's growth often sets market trends, but analysts believe the next cycle's highest percentage gains may come from assets with greater growth velocity. While Bitcoin provides stability, several cryptocurrencies are positioned for stronger relative upside. This article highlights five such assets, with a particular focus on Ozak AI as the potential high-growth standout of the cycle. Ethereum (ETH) is noted for its ongoing evolution and institutional adoption. Solana (SOL) is recognized for its high throughput and history of sharp rallies. Chainlink (LINK) is highlighted as essential infrastructure for DeFi and AI applications. Avalanche (AVAX) is mentioned for its subnet architecture and enterprise potential. Ozak AI ($OZ) is presented as a distinct early-stage opportunity, currently in presale at $0.014 with a target listing price of $1.00. The project is building a full AI-native blockchain ecosystem, including prediction agents, a data stream network, and structured data vaults. Analysts suggest its early valuation stage and focus on AI infrastructure could allow for exponential growth velocity compared to more mature assets like Bitcoin, which requires massive capital inflows for significant price movement. The final takeaway positions Ozak AI as a high-asymmetry bet for investors seeking exponential upside alongside more stable assets.

TheNewsCrypto2 год тому

Five Cryptos That Could Outperform Bitcoin Over the Next Cycle Due To Higher Growth Velocity

TheNewsCrypto2 год тому

Торгівля

Спот
Ф'ючерси
活动图片