Li Fei-Fei's Latest Long-Form Article: When Video Generation, Robotics, and NVIDIA All Call Themselves World Models, We Need a Taxonomy

marsbit发布于2026-07-05更新于2026-07-05

文章摘要

In a new article, Dr. Fei-Fei Li addresses the widespread and often inconsistent use of the term "world model" in AI. She proposes a clear, functional taxonomy rooted in the classic Partially Observable Markov Decision Process (POMDP) loop (agent → action → state → observation → agent). According to this framework, current systems called "world models" are different projections of this loop, categorized by their primary output: 1. **Renderers**: Output observations (pixels). Their goal is visual fidelity for human consumption (e.g., video generation models like Sora). They are the most commercially mature but are limited by a focus on appearance over physical accuracy. 2. **Simulators**: Output states (geometric, physical, dynamic representations). They provide a structurally accurate world for both human professionals (e.g., architects) and computational agents (e.g., robots for training). Li argues simulators are the crucial, underappreciated bridge, as they can underpin both rendering and planning. 3. **Planners**: Output actions. Given an observation and a goal, they decide what an agent should do next (e.g., robotic action models). This area is highly promising but remains the least mature for real-world deployment. Li highlights a key trend: the boundaries between these three categories are beginning to blur, as they all rely on a shared underlying understanding of geometry, physics, and dynamics. The logical endpoint is a unified world foundation model capable of ...

Author: Li Fei-Fei

Compiler: Jiayang

"World model" is probably the hottest and most chaotic concept in the AI field since 2025. When Sora was released, OpenAI called it a world simulator; Genie lets you walk around in generated scenes and also calls it a world model; robotics companies say they are working on world models, NVIDIA says Omniverse is the infrastructure for world models, and even game engines have been pulled into this narrative. Everyone is using the same term, but they are all talking about completely different things.

Today, Li Fei-Fei published a new article on her personal Substack, clarifying this concept. She first returns to the most classic diagram from reinforcement learning textbooks (the POMDP closed loop: agent→action→state→observation→agent), and then points out: what is now called a "world model" is actually three different projections of this closed loop. What outputs pixels (observations) is a renderer, what outputs states is a simulator, and what outputs actions is a planner. The classification standard is very simple: it depends on which part of the closed loop you output.

(Source: MIT Technology Review)

She judges that among the three, the renderer is the most commercially mature but has a ceiling (looking good does not equal being physically correct), the planner is the most exciting but farthest from real-world deployment (the gap between lab demos and practical usability is still huge), and the simulator is the critically underestimated hub. Because the simulator works at the level of geometry, physics, and dynamics, it can project upward into pixels for human consumption, and also derive action consequences for robot use. Mastering simulation gives you the foundation for both rendering and planning; the reverse is not true.

This article is of course also a product manifesto for World Labs. Their Marble already outputs both Gaussian splats and collision meshes, attempting to unify the renderer and simulator into a single model. The endgame depicted at the end of the article is a unified world foundation model, capable of freely switching between rendering, simulation, and planning based on downstream needs. Whether this vision can be realized is another matter, but as an analytical framework, the tripartite classification of renderer/simulator/planner may indeed help cut through some of the noise surrounding the current concept of "world model."

The full text is translated below.

"The world is everything that is the case." — Ludwig Wittgenstein, Tractatus Logico-Philosophicus, 1921

The world is not made of words.

In an earlier article, we proposed that spatial intelligence is the next frontier of AI, and world models are the path to it. Here, the World Labs team and I want to go one level deeper: among the many things now labeled "world models," which functional components truly constitute this capability? And what are their respective uses?

Language models have endowed machines with powerful control over concepts, vocabulary, and reasoning, but the physical world, whether virtual or real, operates on a completely different substrate. Language models learn the statistical structure of text; world models learn the statistical structure of space and time: how light falls on a surface, what a garden looks like from an angle never captured by a camera, how objects respond to forces and follow physical laws.

This makes "world model" one of the most important, and also most abused, terms in AI today. Computer vision, robotics, reinforcement learning, and generative AI all claim to be building world models, but each refers to something quite different. A video model that generates gorgeous but physically impossible flames, a language model that improvises playable games, a physics engine that faithfully simulates combustion—they are all called by the same name.

The ancient Greeks could never agree on what the world was made of, be it fire, water, or indivisible atoms, because "the world" was never a single thing. It was always a stand-in a thinker used to reason about a certain totality. AI has inherited the same problem, and it happens precisely at the moment when the field most needs precision.

The Closed Loop Behind the Taxonomy

To sort out this confusion, we can start with a diagram older than all the technologies mentioned above. All reinforcement learning textbooks, including the classic Sutton and Barto, have for decades used variants of the same diagram to describe how an agent interacts with the world. The formal name of this diagram is a partially observable Markov decision process (POMDP), and the initial definition of the term "world model" belongs to this tradition.

An agent (which can be a human, robot, or software system) performs actions. These actions change the state of the world. But the agent can never directly see the state itself; what it receives are observations: photons hitting the retina, sensor readings, pixels in a video frame. New observations guide new actions, and the cycle repeats.

The word "state" needs to be unpacked because its meaning shifts across fields. This is not the chemist's state, not the distinction between solid, liquid, and gas. This is the physicist and roboticist's state: a complete description of everything that is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world, complete in principle, but forever unobservable directly by any agent within it. Observations are the agent's partial perspective on this reality. Actions are the agent's response based on this.

This closed loop (agent→action→state→observation→agent) is precisely the structure that gives the term "world model" its technical meaning. The phrase itself is older, dating back to Kenneth Craik's 1943 proposal that the mind reasons by running a "small-scale model" of reality, and was introduced into the neural network field in the late 1980s and early 1990s. This closed loop also explains what people mean when they use the term today. The various things now called world models are actually different projections of the same closed loop, each outputting a different component of the loop.

The Three Functions of World Models

The first type of world model is the renderer. The renderer outputs observations, specifically pixels for human eyes, and the most important quality metric is visual fidelity. A video model that turns text prompts into cinematic aerial shots is a renderer; interactive systems like Google's Genie 3 or World Labs' own RTFM are also renderers, generating imagery in real-time based on user input. Such models lack explicit understanding of 3D structure. They generate what a viewer would see, not what things actually are. The buildings in an aerial shot might look flawless from above, but try to navigate the city below and they fall apart.

The second is the simulator. The simulator outputs states: a representation of the world that is faithful in geometry, physics, or dynamics, on which both humans and computer programs can compute and interact. The renderer's contract is purely visual, while the simulator's contract is structural—it requires geometry that holds up to inspection, physics that follows Newton's laws, and dynamics that behave as expected by physical principles. The simulator serves two types of users. Professionals like architects, designers, filmmakers, and game developers need accuracy beyond visual plausibility. Computer programs like reinforcement learning agents, robot controllers, and autonomous vehicles use the simulator as a training ground to interact with the world at scale, testing scenarios that would be dangerous, expensive, or impossible to execute in reality.

The third is the planner. The planner outputs actions. Given an observation and a goal, the planner answers the question: what should the agent do next. In many ways, the planner is the inverse of the renderer. The renderer takes actions as input and produces observations; the planner takes observations as input and produces actions, thereby closing the perception-action loop. Vision-Language-Action models (VLA), model-based systems, and the new wave of World Action Models are all attempts at planning: enabling systems to decide what a robot should do in an unstructured world.

These three categories cover most of the work currently being implemented, and distinguishing between them is useful in practice. But these categories are not fundamentally separate. They share the same underlying knowledge about how the world works: geometry, physics, dynamics. A model that can render a cup from any angle should, in principle, also be able to simulate what happens if the cup is pushed, and plan a hand to pick it up. Increasingly, the most interesting research is intentionally blurring the boundaries between these three.

Image丨The Three Types of World Models (Source: Substack)

Why Simulation is the Critical Hub

Among the three categories, the simulator receives the least public attention but is the most important. This article aims to correct that asymmetry.

Renderers are currently the most commercialized. A multitude of image- or text-to-video products are expanding rapidly in both consumer and enterprise markets. Google's Nano Banana model delivers renderer-level image generation capabilities to possibly hundreds of millions of users. The technology is real, and the market is real. However, the target of renderer optimization is visual plausibility, not physical accuracy, and that ceiling is important. Their outputs are beautiful, but you cannot use them to design a building or train a robot.

Planners are the most exciting and least mature, closely related to the rapidly evolving field of robot learning. Over the past two years, this field has produced many robot demonstrations that look impressive in videos, but we need to be honest about what these demos actually show. Almost all are confined to highly constrained lab environments, with limited object variety and short task durations. None have been validated against the complexity, diversity, and duration required for real-world deployment. The gap between a stunning demo video and a robot that works reliably in a kitchen, warehouse, or operating room remains vast.

Despite this, the scale of commercial bets is substantial. A wave of well-funded newcomers is racing to launch general-purpose planning systems, while large infrastructure players are building planning capabilities on top of broader simulation stacks.

Simulation is the bridge connecting the two. If language is an abstraction of the world, and pixels are a projection of the world, then geometry, physics, and dynamics are the world itself. The simulator must operate at this level: it is the structural skeleton from which visual appearance (for renderers) and action consequences (for planners) can both be derived.

A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents. A model that only masters rendering or only planning can do neither. The commercial space here is immense. NVIDIA's Omniverse alone targets a total addressable market estimated by the company at over a trillion dollars, covering factories, warehouses, supply chains, and digital twins. Robot training, autonomous driving testing, architectural visualization, engineering design, drug discovery—all rely on some form of simulation.

The most difficult open questions in the field are also concentrated here. 3D data with explicit geometry, material properties, and physics annotations is several orders of magnitude scarcer than the internet videos used to train renderers. The sim-to-real gap (the difference between how objects behave in simulation and in the real world) persists. Generative simulators add a new risk: AI-generated geometry may look correct but actually contain self-intersections or incorrect scales, causing physics simulation to produce absurd results. The computational cost of large-scale multi-physics simulation (rigid bodies, deformable objects, fluids, cloth all interacting simultaneously) is still several orders of magnitude higher than single-domain simulation.

At World Labs, Marble is our first step in this direction. It takes multimodal input (text, images, video, or spatial sketches), generates explorable 3D environments, and simultaneously outputs Gaussian splats for visual exploration and collision meshes for physics engine operation. But Marble is only the first chapter of a long arc. This story is being written across the entire field as the boundaries between rendering, simulation, and planning begin to dissolve.

Boundaries are Dissolving, and What Comes Next

The most important trend in the field right now is that the three categories are beginning to merge. The underlying consensus is: the knowledge needed to render a world, simulate it, and act within it is largely the same. Following the previous example, a model that truly understands how a cup sits on a table (its geometry, material properties, response to force, etc.) should be able to render that cup from any angle, simulate what happens if the cup is pushed, and plan a hand to pick it up. The three categories are three projections of the same underlying understanding.

For instance, a small but growing body of work from various robotics labs has recently shown a possibility that is at least conceptually viable: a pre-trained video renderer can serve as a backbone network for joint world prediction and action prediction, allowing a single model to simultaneously imagine "what will happen" and "what to do," thus bridging renderer and planner. World Labs' Marble already outputs both Gaussian splats and collision meshes from a single model, dissolving the boundary between renderer and simulator. At every level, there is a shift from passive output to interactive systems: renderers become responsive to action conditions, simulators generate worlds that are more controllable and editable, and planners begin to engage in deliberative reasoning rather than just reacting.

The logical endpoint is a unified world model: a foundation model capable of rendering photorealistic views, generating physically accurate structures, planning action sequences, and switching between different output modalities based on downstream user needs. We will still face a series of daunting challenges. The data landscape is highly uneven, with renderers sitting atop vast amounts of internet video, while simulators and planners suffer from severe shortages of 3D assets and robot demonstration data. Optimization for visual aesthetics may sacrifice the precision needed for robotics or high-fidelity simulation. Reconciling these tensions within a single architecture is the central open problem in world model research today, and is what World Labs is committed to solving as Marble continues to evolve.

(Source: Substack)

But the general direction is already clear. From the late 1980s to today, the field has always bet on the same wager: if the world model is rich enough, everything an agent needs to see the world, construct it, and act within it will be inside. This bet is now driving a generation of research. What truly adds weight to it is the already ongoing fusion: the three lines of rendering, simulation, and planning, each already supporting industries worth tens of billions of dollars, initially independent research directions, are now converging. When boundaries disappear, the confluence of the three will redefine something larger: the relationship between machine intelligence and the physical world it inhabits, which is the long-term trajectory of spatial intelligence.

Language gave machines a way to talk about the world. World models are how machines will ultimately understand, imagine, reason about, and interact with it.

Reference: 1.https://drfeifei.substack.com/p/a-functional-taxonomy-of-world-models

你可能也喜欢

UXLINK攻击者14336枚ETH转账为DeFi领域带来新问题

UXLINK黑客近期通过14,336笔ETH转移进行洗钱活动，再次暴露DeFi生态的安全隐患。该漏洞源于2025年9月，黑客利用“delegateCall”漏洞控制项目多签钱包，非法增发代币并盗取约450万美元资产。随后，黑客将赃款兑换为DAI和ETH，近期更将剩余约1054万DAI兑换为6000.8 ETH，并分批存入混币器Tornado Cash进行洗钱。同时，已倒闭的Mining Express骗局相关钱包也在转移资产，将5004 ETH兑换为880万DAI，部分资金同样流入Tornado Cash。这些案例显示，尽管DeFi支持无需许可的资产转移，但一旦非法资金进入生态，仍缺乏有效机制进行拦截或追踪，容易通过跨协议操作隐藏流向。专家指出，为维护去中心化与用户隐私，DeFi协议需加强跨链协同与实时威胁监测，以填补安全漏洞。

ambcrypto1小时前

ambcrypto1小时前

福布斯特稿：稳定币跨境支付更快了，但还没更便宜

福布斯文章指出，稳定币在跨境支付领域的应用正在快速增长，其速度、可及性和可靠性已得到验证，但预期的成本优势尚未完全实现。传统外汇经纪商通常收取60-70基点的费用，而稳定币理论上可将成本压至2-5基点，关键在于缺乏规模化的深度流动性池。 Bitso Business的负责人Imran Ahmad指出，在银行等大型机构直接接入并提供充足流动性之前，成本优势仍停留在理论层面。此外，B2B支付中基于长期人际信任的传统关系，也是稳定币普及的一大障碍。客户往往更依赖熟悉且可靠的传统代理，而非单纯追求更低价格。成功的稳定币支付公司并非试图完全取代现有系统（如SWIFT），而是作为其补充。例如Caliza公司，虽利用稳定币通道，但仍会结合SWIFT网络以确保付款信息的准确与合规，因为供应商付款中准确性至关重要。这种务实策略带来了强劲的业务增长。行业共识是，最终能立足的稳定币支付企业需要具备关键基础设施：合规牌照、稳固的法币通道以及充足的流动性。缺少这些，则 merely 是中间人。尽管当前增长迅猛，但行业预计未来将出现整合与淘汰。

链捕手2小时前

链捕手2小时前

李飞飞最新长文：当视频生成、机器人和 NVIDIA 都自称世界模型，我们需要一个分类法

李飞飞发表文章，针对当前AI领域中“世界模型”一词被广泛滥用的现象提出一个清晰的功能分类法。她指出，尽管视频生成、机器人和NVIDIA等不同领域都自称构建“世界模型”，但它们实际指的是强化学习闭环（POMDP）中三种不同的功能模块。 **分类法如下：** 1. **渲染器**：输出**观测**（如像素），追求视觉保真度，例如Sora、Genie等视频生成模型。其局限在于“好看不等于物理正确”。 2. **模拟器**：输出**状态**，即在几何、物理和动力学层面忠实的世界表征，服务于建筑设计、机器人训练等需要精确模拟的场景。李飞飞认为这是连接渲染和规划的关键枢纽，被严重低估。 3. **规划器**：输出**动作**，根据观测和目标决定智能体（如机器人）应执行的动作，是感知-行动回路的闭环。 **现状与趋势：** * **渲染器**商业化最成熟，但有物理准确性天花板。 * **规划器**最令人兴奋但最不成熟，实验室演示与实际部署存在巨大鸿沟。 * **模拟器**是核心桥梁，掌握了模拟就同时为渲染和规划提供了基础。当前最重要的趋势是这三类功能的边界正在消融，因为它们共享对世界底层运作（几何、物理、动力学）的同一套理解。例如，World Labs的Marble模型能同时输出用于视觉的高斯泼溅和用于物理模拟的碰撞网格。逻辑终点是构建一个**统一的世界基础模型**，能根据下游需求在渲染、模拟和规划模式间自由切换。尽管面临数据不均衡、优化目标冲突等挑战，但三者的融合将重新定义机器智能与物理世界的关系，推动空间智能的发展。

链捕手2小时前

李飞飞最新长文：当视频生成、机器人和 NVIDIA 都自称世界模型，我们需要一个分类法

链捕手2小时前

比特币：以下是比特币第三季度价格上涨可能面临的流动性考验

比特币的熊市痛苦可能即将结束。链上数据显示市场正进入看跌阶段的最后时期，投资者正以低于成本价抛售，导致已实现盈亏比率降至-0.35，为43个月最低，这通常预示市场底部。同时，美国现货比特币ETF恢复净流入2.23亿美元，表明机构需求回归，有助于巩固6万美元支撑位。然而，复苏面临关键挑战：市场流动性持续收缩。尽管ETF资金流入，但稳定币（USDC、USDT）总市值近期仍在下降，过去一个月分别减少3.6%和2%，过去一周有超10亿美元流出市场。流动性不足意味着缺乏足够的现货购买力来支撑上涨。此外，比特币杠杆头寸再次增加，但若稳定币流动性无法改善，市场可能难以维持涨势，并容易因杠杆清算出现大幅回调。因此，虽然底部信号增强，但流动性薄弱仍是主要风险，比特币第三季度的价格上涨可能面临考验。

ambcrypto2小时前

ambcrypto2小时前

Sonic代币价格反弹为何可能比看起来更脆弱

Sonic（S）在近期出现了价格大幅反弹，但这轮涨势的基础并不稳固，主要因为网络仍面临用户流失率高的问题。价格上涨主要得益于链上活动和现货交易量的短期增长。近日链上交易笔数从128,600笔上升至238,400笔，接近翻倍；同期现货交易量也从约67.86万美元增至160万美元。然而，从30天的长期数据看，链上交易笔数和现货交易量分别下降了65%和31.6%，表明网络整体仍处于复苏而非扩张阶段。更关键的挑战在于用户基础的萎缩。近期每日活跃用户数在6,400至7,600之间徘徊，与6月4日62,200的峰值相比下降了近十倍，显示大量用户已经离开，链上活动需求并未真正恢复。此外，项目近期经历了重大治理变革，包括多位核心董事会成员辞职以及新任CEO上任。在治理变动生效时，S代币价格已较2025年1月的峰值下跌约97%。同时，项目团队已停止原定每年增发4760万枚S（占总供应量1.5%）的计划，改变了其通胀模型。尽管永续合约市场的资金费率显示部分交易者看涨短期走势，但链上用户活动疲软，长期数据依然下滑，意味着当前的反弹可能缺乏持续性支撑。

ambcrypto5小时前

ambcrypto5小时前

交易

现货

Li Fei-Fei's Latest Long-Form Article: When Video Generation, Robotics, and NVIDIA All Call Themselves World Models, We Need a Taxonomy

文章摘要

The Closed Loop Behind the Taxonomy

The Three Functions of World Models

Why Simulation is the Critical Hub

Boundaries are Dissolving, and What Comes Next

相关问答

你可能也喜欢

UXLINK攻击者14336枚ETH转账为DeFi领域带来新问题

福布斯特稿：稳定币跨境支付更快了，但还没更便宜

李飞飞最新长文：当视频生成、机器人和 NVIDIA 都自称世界模型，我们需要一个分类法

比特币：以下是比特币第三季度价格上涨可能面临的流动性考验

Sonic代币价格反弹为何可能比看起来更脆弱

交易

热门分类

热门标签