# Spatial Intelligence Related Articles

HTX News Center provides the latest articles and in-depth analysis on "Spatial Intelligence", covering market trends, project updates, tech developments, and regulatory policies in the crypto industry.

Li Fei-Fei's Latest Long-Form Article: When Video Generation, Robotics, and NVIDIA All Call Themselves World Models, We Need a Taxonomy

In a new article, Dr. Fei-Fei Li addresses the widespread and often inconsistent use of the term "world model" in AI. She proposes a clear, functional taxonomy rooted in the classic Partially Observable Markov Decision Process (POMDP) loop (agent → action → state → observation → agent). According to this framework, current systems called "world models" are different projections of this loop, categorized by their primary output: 1. **Renderers**: Output observations (pixels). Their goal is visual fidelity for human consumption (e.g., video generation models like Sora). They are the most commercially mature but are limited by a focus on appearance over physical accuracy. 2. **Simulators**: Output states (geometric, physical, dynamic representations). They provide a structurally accurate world for both human professionals (e.g., architects) and computational agents (e.g., robots for training). Li argues simulators are the crucial, underappreciated bridge, as they can underpin both rendering and planning. 3. **Planners**: Output actions. Given an observation and a goal, they decide what an agent should do next (e.g., robotic action models). This area is highly promising but remains the least mature for real-world deployment. Li highlights a key trend: the boundaries between these three categories are beginning to blur, as they all rely on a shared underlying understanding of geometry, physics, and dynamics. The logical endpoint is a unified world foundation model capable of switching between rendering, simulation, and planning based on downstream needs. This convergence, she concludes, is central to advancing spatial intelligence—enabling machines not just to talk about the world, but to truly understand, imagine, and interact with it.

marsbit7h ago

Li Fei-Fei's Latest Long-Form Article: When Video Generation, Robotics, and NVIDIA All Call Themselves World Models, We Need a Taxonomy

marsbit7h ago

Li Feifei's Latest Article: When Video Generation, Robotics, and NVIDIA All Claim to Have 'World Models,' We Need a Taxonomy

"World Model" has become a widely used yet ambiguous term in AI. Drawing from the classic POMDP framework (agent → action → state → observation), this article proposes a functional taxonomy to clarify the concept. It identifies three distinct types, categorized by their output in the perception-action loop: 1. **Renderers**: Output visual observations (pixels). These models, like advanced video generators, prioritize visual fidelity but often lack underlying physical accuracy. 2. **Simulators**: Output the state of the world (geometry, physics, dynamics). They provide a structurally accurate representation for professionals (e.g., architects) and serve as training environments for robots and AI agents. 3. **Planners**: Output actions. Given an observation and a goal, they determine what an agent should do next, closing the perception-action loop (e.g., vision-language-action models). While renderers are currently the most commercially mature and planners are the most aspirational, the article argues that **simulators are the crucial, underappreciated hub**. By working at the level of geometry and physics, a simulator can project upwards to create visuals for humans and downwards to predict action consequences for agents. The future lies in the convergence of these three functions. Emerging research and products, like World Labs' Marble model which outputs both visual splats and physical collision meshes, are beginning to blur these boundaries. The logical endpoint is a unified world foundation model capable of rendering, simulating, and planning based on a shared understanding of spatial and temporal structures—ultimately enabling machines to understand, imagine, and interact with the physical world.

链捕手8h ago

Li Feifei's Latest Article: When Video Generation, Robotics, and NVIDIA All Claim to Have 'World Models,' We Need a Taxonomy

链捕手8h ago

Fei-Fei Li's Manifesto for World Models

"Feifei Li's World Model Manifesto" draws a crucial distinction between current AI's linguistic prowess and its lack of understanding of the physical world. Citing Wittgenstein, Li argues that true intelligence requires moving beyond text statistics to comprehend physical laws like optics, inertia, and collision. The article diagnoses the current confusion around "world models" and proposes a clear taxonomy based on the Partially Observable Markov Decision Process (POMDP) framework. Li identifies three core, interdependent pillars for building such models: 1) The **Renderer**, which masters visual plausibility and pixel generation (e.g., Sora, image models) but lacks structural integrity. 2) The **Simulator**, which prioritizes strict adherence to physical laws (mass, friction, collision) and is essential for robotics and real-world application, though it is computationally demanding and data-hungry. 3) The **Planner**, which connects perception to action, enabling decision-making in complex, unstructured environments. Li posits the **Simulator as the critical nexus** linking rendering and planning, highlighting NVIDIA's Omniverse as a leading example. Mastering physical simulation is key to industrial AI applications. Despite challenges like scarce annotated 3D data and "physics-unrealistic" generative outputs, a convergent trend is emerging. The future lies in a **unified foundational model** that seamlessly integrates rendering, simulation, and planning into a dynamic, interactive system. Ultimately, this pursuit of "world models" represents the next evolutionary step for AI: developing **spatial intelligence** to interact with the physical world. It's not merely an algorithmic challenge but a redefinition of digital-physical standards on the path to AGI. However, as noted by Yann LeCun, achieving even rudimentary physical understanding akin to a dog's intelligence may still be years away.

marsbit06/09 00:37

marsbit06/09 00:37

From a Lunch Table to an Infinite Universe: Fei-Fei Li Bets on AI's Next Dimension

From a Lunch Table Conversation to an Infinite Universe: Fei-Fei Li Bets on AI's Next Frontier - Spatial Intelligence In an era dominated by large language models, AI pioneer Fei-Fei Li argues that true understanding requires spatial intelligence — the ability to perceive, reason, and interact within the physical 3D/4D world. She points to evolutionary history: spatial perception drove the Cambrian explosion 540 million years ago, while language is a far more recent, inherently "lossy" way to encode reality. Current models struggle with basic spatial tasks a child can do, like counting chairs in a video. Her company, World Labs, is pioneering this shift with "Marble," a model that generates navigable, consistent 3D worlds from text, images, or simple 3D inputs—distinct from video generators like Sora. Though smaller than models like GPT-5, due to scarce 3D data and early-stage scaling laws, Marble is already used in gaming, robot training (by NVIDIA), architectural design, and personalized therapy for conditions like OCD and acrophobia. Li envisions this technology enabling "infinite universes" for creativity, social interaction, and more. However, she cautions against utopian or dystopian extremes, advocating for a measured vision where AI enhances human dignity and prosperity, akin to how electricity transformed civilization. The journey is long — as evidenced by the 20-year path to viable autonomous vehicles — but the direction is clear: for AI to move from merely talking about the world to truly understanding and acting within it.

marsbit05/27 00:14

From a Lunch Table to an Infinite Universe: Fei-Fei Li Bets on AI's Next Dimension

marsbit05/27 00:14

Understanding Jensen Huang's Physical AI: Why Is Crypto's Opportunity Also Hidden in the 'Nooks and Crannies'?

Jensen Huang's recent speech at Davos signals a pivotal shift in AI: the transition from the training-focused "brute force" era of AI 1.0 to the new paradigm of "Physical AI" and inference. This marks the next phase after Generative AI, focusing on real-world application and embodiment. Physical AI aims to solve the "last-mile" problem of AI: moving from digital intelligence to physical action. While LLMs have consumed vast digital data, they lack understanding of the physical world—like how to twist open a bottle cap. Physical AI requires three core capabilities: 1. Spatial Intelligence: AI must perceive and interpret 3D environments in real-time, understanding object properties, depth, and interaction dynamics. 2. Virtual Training Grounds: Systems like NVIDIA’s Omniverse enable simulation-to-real (Sim-to-Real) training, allowing robots to learn through vast virtual iterations without costly physical failures. 3. Electronic Skin and Touch Data: Sensors that capture tactile feedback—temperature, pressure, texture—are critical. This data is a new, untapped asset class. This shift opens significant opportunities for Crypto and Web3 ecosystems. DePIN networks can crowdsource hyperlocal spatial data from "every corner" of the world through token incentives. Distributed computing networks can provide edge-based rendering and inference power for low-latency physical responses. Tokenized data ownership and privacy-preserving sharing mechanisms can enable the scalable, ethical collection of sensitive tactile data. In short, Physical AI isn’t just the next chapter for Web2—it’s a catalyst for Web3 domains like DePIN, DeData, and decentralized AI.

marsbit01/23 00:35

Understanding Jensen Huang's Physical AI: Why Is Crypto's Opportunity Also Hidden in the 'Nooks and Crannies'?