# Bài viết Liên quan World Models

Trung tâm Tin tức HTX cung cấp những bài viết mới nhất và phân tích chuyên sâu về "World Models", bao gồm xu hướng thị trường, cập nhật dự án, phát triển công nghệ và chính sách quản lý trong ngành tiền kỹ thuật số.

Li Feifei's Latest Article: When Video Generation, Robotics, and NVIDIA All Claim to Have 'World Models,' We Need a Taxonomy

"World Model" has become a widely used yet ambiguous term in AI. Drawing from the classic POMDP framework (agent → action → state → observation), this article proposes a functional taxonomy to clarify the concept. It identifies three distinct types, categorized by their output in the perception-action loop: 1. **Renderers**: Output visual observations (pixels). These models, like advanced video generators, prioritize visual fidelity but often lack underlying physical accuracy. 2. **Simulators**: Output the state of the world (geometry, physics, dynamics). They provide a structurally accurate representation for professionals (e.g., architects) and serve as training environments for robots and AI agents. 3. **Planners**: Output actions. Given an observation and a goal, they determine what an agent should do next, closing the perception-action loop (e.g., vision-language-action models). While renderers are currently the most commercially mature and planners are the most aspirational, the article argues that **simulators are the crucial, underappreciated hub**. By working at the level of geometry and physics, a simulator can project upwards to create visuals for humans and downwards to predict action consequences for agents. The future lies in the convergence of these three functions. Emerging research and products, like World Labs' Marble model which outputs both visual splats and physical collision meshes, are beginning to blur these boundaries. The logical endpoint is a unified world foundation model capable of rendering, simulating, and planning based on a shared understanding of spatial and temporal structures—ultimately enabling machines to understand, imagine, and interact with the physical world.

链捕手8 giờ trước

Li Feifei's Latest Article: When Video Generation, Robotics, and NVIDIA All Claim to Have 'World Models,' We Need a Taxonomy

链捕手8 giờ trước

Xing Bo Strikes Again: Last Time 'Critiquing' World Models, This Time It's Agents' Turn

Xing Bo, President of MBZUAI and professor at Carnegie Mellon University, along with co-authors Mingkai Deng and Jinyu Hou, has released a new paper, "Critique of Agent Model," critiquing the current state of artificial intelligence agents. The paper draws a crucial distinction between "agentic" systems, which rely on external toolchains, prompts, and workflows, and truly "agentive" systems capable of genuine autonomy driven by internal decision-making structures. To illustrate this, it references a real-world incident where an AI programming assistant, following an external prompt but lacking internalized judgment, caused a catastrophic data deletion. The authors propose a detailed analysis and a new framework, "Goal-Identity-Configurator" (GIC), for building truly autonomous agents. This framework systematically addresses five key dimensions where current "Agent" designs fall short: 1. **Goal:** Moving from step-by-step human instruction to a system capable of autonomously decomposing a single long-term goal and adapting sub-goals based on new information. 2. **Identity:** Evolving self-assessment updated by experience, rather than a static description in a system prompt. 3. **Decision Making:** Replacing textual Chain-of-Thought reasoning with "simulative reasoning" that uses a dedicated world model to predict real-world consequences before selecting actions. 4. **Cognitive Control:** Introducing a separate "System III" metacognitive module that dynamically decides when to deliberate, stick to a plan, or act quickly. 5. **Learning:** Enabling "continual autonomous learning," where the agent itself decides when to act, practice in simulation, or update its world model and self-perception. The GIC architecture integrates six components—a belief encoder, goal decomposer, identity evolver, configurator (System III), simulation-based planner (System II), and executor (System I)—to embody these principles. The paper argues that a growth path akin to pilot training (ground theory, simulator practice, real deployment) should be underpinned by a unified cognitive architecture, not separate workflows. On safety, the authors contend that the GIC framework's modular, explicit design enhances inspectability, allowing problematic behavior to be traced to specific components (e.g., flawed goal or poorly trained module) rather than emerging opaquely. However, they acknowledge that ultimate safety depends on correctly training these modules in the first place. In conclusion, the paper challenges the loose application of the term "Agent," asserting that task completion alone does not equal true autonomy. True autonomy requires goals, identity, and judgment to be genuinely internalized within the agent's architecture, not merely enforced by external scripts.

marsbit07/01 11:25

Xing Bo Strikes Again: Last Time 'Critiquing' World Models, This Time It's Agents' Turn

marsbit07/01 11:25

World Models, Metaverse, Digital Twins, Physical AI: Are They the Same Thing?

Title: World Models, the Metaverse, Digital Twins, Physical AI: Are They the Same Thing? The article clarifies that concepts like the metaverse, Web3, simulation platforms, digital twins, and Physical AI are not the same thing but are all part of the broader trend of blurring the lines between the digital and physical worlds. It positions "world models" as the foundational "cognitive layer" or "operating system" that enables AI to understand and simulate the world. Key distinctions are made: - The **Metaverse** is a destination for immersive social and economic experiences. World models could act as its "engine," generating interactive 3D content efficiently. - **Web3** focuses on decentralized ownership and economics (rules layer), operating on a different technical level than world models. - **Simulation Data Platforms** (e.g., for autonomous vehicles) are a 1.0 version, relying on manual design. World models represent a 2.0 version, using AI to generate realistic, varied scenarios autonomously. - **Digital Twins** create high-fidelity, real-time mirrors of physical systems (e.g., a factory). World models go a step further by enabling predictive simulation of future states. - **Physical AI** (robots, AVs) refers to AI that acts in the physical world. World models are a core component, providing the understanding and prediction needed for planning. A proposed hierarchy places world models at the cognitive layer, supported by infrastructure (compute, data) and supporting application tools (simulation, digital twins), action systems (Physical AI), user experiences (metaverse), and rules (Web3). In conclusion, while distinct, many of these previously hyped concepts may ultimately rely on advances in world model technology to fulfill their promises, as world models provide the essential cognitive foundation for simulating and interacting with complex environments.

marsbit06/28 10:41

World Models, Metaverse, Digital Twins, Physical AI: Are They the Same Thing?

marsbit06/28 10:41

Fei-Fei Li's Manifesto for World Models

"Feifei Li's World Model Manifesto" draws a crucial distinction between current AI's linguistic prowess and its lack of understanding of the physical world. Citing Wittgenstein, Li argues that true intelligence requires moving beyond text statistics to comprehend physical laws like optics, inertia, and collision. The article diagnoses the current confusion around "world models" and proposes a clear taxonomy based on the Partially Observable Markov Decision Process (POMDP) framework. Li identifies three core, interdependent pillars for building such models: 1) The **Renderer**, which masters visual plausibility and pixel generation (e.g., Sora, image models) but lacks structural integrity. 2) The **Simulator**, which prioritizes strict adherence to physical laws (mass, friction, collision) and is essential for robotics and real-world application, though it is computationally demanding and data-hungry. 3) The **Planner**, which connects perception to action, enabling decision-making in complex, unstructured environments. Li posits the **Simulator as the critical nexus** linking rendering and planning, highlighting NVIDIA's Omniverse as a leading example. Mastering physical simulation is key to industrial AI applications. Despite challenges like scarce annotated 3D data and "physics-unrealistic" generative outputs, a convergent trend is emerging. The future lies in a **unified foundational model** that seamlessly integrates rendering, simulation, and planning into a dynamic, interactive system. Ultimately, this pursuit of "world models" represents the next evolutionary step for AI: developing **spatial intelligence** to interact with the physical world. It's not merely an algorithmic challenge but a redefinition of digital-physical standards on the path to AGI. However, as noted by Yann LeCun, achieving even rudimentary physical understanding akin to a dog's intelligence may still be years away.

marsbit06/09 00:37

marsbit06/09 00:37

From Code to Cognition: A Ten-Thousand-Word Guide to the Evolution of the Robot Brain

"From Code to Cognition: The Evolution of Robot Brains" The journey of robotic intelligence has shifted dramatically from manually coded systems to AI-driven brains. For decades, robots relied on layered software stacks—perception, state estimation, planning, control—each handcrafted. While predictable, they lacked adaptability. The 2010s saw deep learning revolutionize perception (e.g., object detection) and control (via reinforcement learning), but learned skills remained narrow. The arrival of Large Language Models (LLMs) marked a turning point. LLMs acted as high-level planners, interpreting natural language instructions and generating sequences of actions for traditional robotic systems to execute. However, true integration came with Visual-Language-Action (VLA) models, which fused vision, language, and motion prediction into a single network. Pioneered by models like RT-2 and open-source projects like OpenVLA, VLAs enable robots to reason and act directly from visual input and commands. The most advanced humanoid robots now employ a "dual-brain" architecture: a slow-thinking, large VLA (System 2) for reasoning and planning, and a fast-reacting, small network (System 1) for high-frequency motion control, sometimes with an even lower-level System 0 for balance. This split balances cognition with the physics of real-time movement. Computation is split between onboard hardware (e.g., NVIDIA Jetson) for safety-critical control loops and cloud/edge servers for non-critical tasks like learning and interfaces. A crucial driver is the open-source ecosystem—models like GR00T and OpenVLA allow startups to build upon pre-trained brains and fine-tune them with their own data, accelerating development. Despite progress, current systems struggle with recovery from errors, sample inefficiency, and long-horizon tasks. This has spurred the rise of **World Models**—neural networks that predict the consequences of actions. By simulating possible futures before acting (like NVIDIA Cosmos or Meta V-JEPA), robots can plan, recover, and generalize better. This represents the next frontier: shifting intelligence from learned reactions to an internal model of physics and cause-and-effect. The field is rapidly evolving. While not yet at its "ChatGPT moment," the convergence of cheaper hardware, scalable simulation, and world models points toward robots that are increasingly capable, adaptive, and useful. The question is shifting from "what can robots do?" to "what *should* they do?"

marsbit06/07 12:55

From Code to Cognition: A Ten-Thousand-Word Guide to the Evolution of the Robot Brain

marsbit06/07 12:55

Fei-Fei Li's Team Clarifies the Concept of 'World Models', Sora Merely a Renderer

"World Models" has become a widely used yet confusing term in AI. To address this, a team led by Fei-Fei Li and World Labs proposed a functional taxonomy based on the Partially Observable Markov Decision Process framework. This taxonomy categorizes systems called "world models" into three distinct projections: Renderers, Simulators, and Planners. Renderers, like OpenAI's Sora and other video generation models, focus on producing photorealistic visual outputs for human perception. They prioritize visual fidelity over physical accuracy. Simulators, such as NVIDIA Omniverse, aim to compute precise future environmental states for computational tasks like engineering analysis or digital twins. Planners, like Vision-Language-Action models, take in observations and goals to output executable actions for robots or agents. The article clarifies that most current "world models," including Sora, are primarily Renderers. They generate convincing visuals but lack the core ability to simulate state transitions based on actions, a key requirement for a true world model in classic reinforcement learning definitions. This conceptual confusion has practical implications, leading to potential misalignment in technology selection, investment, and public understanding of AI capabilities. Clear categorization is crucial. It helps enterprises avoid costly mistakes (e.g., using a renderer for robot training), allows investors to accurately assess markets, and enables researchers to build comparable benchmarks. While future systems may integrate these functions, recognizing current boundaries is essential for honest assessment and progress.

marsbit06/04 03:16

Fei-Fei Li's Team Clarifies the Concept of 'World Models', Sora Merely a Renderer

marsbit06/04 03:16

World Models Shift from Prediction to Planning: HWM and the Challenge of Long-Horizon Control

World models have evolved from focusing on representation learning and future prediction to addressing long-horizon planning challenges. While models like V-JEPA 2 demonstrate strong predictive capabilities using large-scale video pre-training, they struggle with multi-stage control tasks due to error accumulation and exponential growth in action search space. HWM (Hierarchical World Model) introduces a two-level planning structure: a high-level planner outlines coarse subgoals over longer time horizons, while a low-level executor handles short-term actions. This decomposition reduces planning complexity and error propagation. In experiments, HWM achieved 70% success in real-world robotic tasks where flat models failed entirely. Complementary efforts include V-JEPA (focused on representation), HWM (on hierarchical planning), and WAV (World Action Verifier, on self-correction). Together, they mark a shift from pure world modeling to integrated systems capable of prediction, planning, and verification—key to deploying world models in real-world agents and long-term tasks.

marsbit04/17 10:26

World Models Shift from Prediction to Planning: HWM and the Challenge of Long-Horizon Control