Fei-Fei Li's Manifesto for World Models

marsbitPublished on 2026-06-09Last updated on 2026-06-09

Abstract

"Feifei Li's World Model Manifesto" draws a crucial distinction between current AI's linguistic prowess and its lack of understanding of the physical world. Citing Wittgenstein, Li argues that true intelligence requires moving beyond text statistics to comprehend physical laws like optics, inertia, and collision. The article diagnoses the current confusion around "world models" and proposes a clear taxonomy based on the Partially Observable Markov Decision Process (POMDP) framework. Li identifies three core, interdependent pillars for building such models: 1) The **Renderer**, which masters visual plausibility and pixel generation (e.g., Sora, image models) but lacks structural integrity. 2) The **Simulator**, which prioritizes strict adherence to physical laws (mass, friction, collision) and is essential for robotics and real-world application, though it is computationally demanding and data-hungry. 3) The **Planner**, which connects perception to action, enabling decision-making in complex, unstructured environments. Li posits the **Simulator as the critical nexus** linking rendering and planning, highlighting NVIDIA's Omniverse as a leading example. Mastering physical simulation is key to industrial AI applications. Despite challenges like scarce annotated 3D data and "physics-unrealistic" generative outputs, a convergent trend is emerging. The future lies in a **unified foundational model** that seamlessly integrates rendering, simulation, and planning into a dynamic, i...

"The world is everything that is the case."

In 1921, Ludwig Wittgenstein wrote this famous sentence in *Tractatus Logico-Philosophicus*. A century later, it is quoted by AI pioneer Fei-Fei Li as the opening of her latest technical blog post.

In the landscape of deep learning, people have become accustomed over the past three years to AI's disruptive impact on language, starting with ChatGPT which endowed machines with expression, programming, and reasoning abilities far surpassing humans.

However, behind this digital miracle lies a blind spot that is often overlooked: machines can talk about the world, yet remain ignorant of its physical essence. The blog post released by Fei-Fei Li serves as a sobering reality check.

Today, as generative AI has become an indispensable tool globally, the industry's internal definition of "world models" is becoming increasingly chaotic. Whether in video generation or embodied intelligence, various companies are vying for the interpretive authority of this concept.

After Fei-Fei Li published this blog post, many believed she was attempting to reclaim the definition of "world models." But on the contrary, I think what Fei-Fei Li truly aims to do is to issue a declaration: The world is not constituted by language, but by the rigorous laws of physical space and time.

For machines to truly step into the human physical world, they must break free from the comfort zone of text statistics and instead understand the refraction of light, the inertia of objects, and the logic of collisions. This is not only a paradigm shift in technology but also a necessary path for AI's advancement toward embodied intelligence.

01 We Need a Taxonomy

It must be admitted that in the AI lexicon, "world model" has devolved into a catch-all pronoun; any project involving image generation or environment simulation seems capable of being linked to it. This ambiguity stems precisely from the multi-dimensional human need to define the "world."

When a technology is just starting out, there naturally won't be unified doctrines to confine it within clear boundaries. This chaos in defining "world models" is not uncommon in history. When ancient Greek philosophers debated whether the essence of the world was water, fire, or indivisible atoms, they were essentially searching for a cornerstone for their reasoning.

The AI field now faces a similar problem: When a video generation model produces visuals that are extremely realistic yet physically impossible, how should we define it? Fei-Fei Li's blog mentions an ancient and robust foundational definition: the Partially Observable Markov Decision Process (POMDP).

This is also the core axiom of reinforcement learning mechanisms, revealing the eternal closed loop of interaction between an agent and the physical world: The agent takes an Action, leading to a change in the world's State. However, the agent lacks a god's-eye view and can only construct a partial perception of reality through Observation.

Essentially, a world model is the abstract model of the world that a machine builds in its "brain" to survive within this closed loop. If any part of this loop is not clearly defined, then the so-called world model remains merely a blind stacking of pixels.

02 The Three Pillars of Building Intelligence

This loop sounds simple, with each component's function easily understood. However, upon careful analysis, each contains countless details with blurred definitions. To explain the chaos within, Fei-Fei Li deconstructs world models into three core components. They serve both as a technical taxonomy and as the three pillars for AI's journey toward embodied intelligence.

1. Renderer

The core logic of the renderer is visual plausibility. Its output is pixels, striving to make the imagery appear natural, coherent, and aesthetically pleasing to the human eye.

This is currently the most mature field commercially. Models we are familiar with, such as OpenAI's Sora and ByteDance's Seedance 2.0 for video generation, and OpenAI's GPT-image-2 and Google's Nano Banana 2 for image generation, are essentially the most sophisticated visual probability machines available. By learning from billions of internet images and videos, they have ultimately mastered the distribution patterns of light, shadow, and form.

This seemingly beautiful reality comes at a cost, as Fei-Fei Li points out. While these top models can generate magnificent architecture, attempting to interact within their generated physical structures would likely cause the building to collapse instantly due to a lack of support structure. In other words, they don't understand what "support" is; they generate only what the viewer "sees," not what the world "is."

2. Simulator

What the simulator pursues is precisely the structural fidelity that the renderer lacks. It doesn't care at all whether a video looks good; its sole concern is whether the world follows physical laws. When a simulator outputs a mundane cup, it must include the cup's mass distribution, material friction coefficient, gravity response, and physical boundaries during collisions.

With a simulator, the content in videos gains a claim to realism. However, simulators are not only severely underestimated but often outright ignored in the current AI wave.

From the case of the cup above, the existence of a simulator transforms "discussing art" into "studying physics." Constructing a simulator that strictly adheres to physical laws requires unimaginable computational resources and annotation costs. But for robots, visual aesthetics are almost a useless attribute; physical precision determines everything.

If a simulator isn't accurate enough, robots trained within it can never enter the real world. The Sim-to-Real challenge is objectively real. Test actions that pass 100% in the lab can be completely paralyzed by minute friction in the real world—this is what we often call the "Moravec's paradox."

3. Planner

The planner is responsible for action output. As the connection point between perception and feedback, it needs to solve the core question with no standard answer: "What should be done next?" In Fei-Fei Li's framework, this is also the final component of the entire "perception-action" closed loop and simultaneously the most frontier-challenging domain.

All current Vision-Language-Action (VLA) models are attempting to enable systems to make decisions in unstructured, complex worlds. The planner doesn't merely predict the future; it chooses, from countless possibilities, the path most likely to achieve the goal. It is the key for machines to evolve from "observers" into "practitioners."

03 The Hundred-Billion-Dollar Hub

Among the three categories Fei-Fei Li outlines, models corresponding to the renderer and planner are relatively common; the remaining simulator has logically become the most difficult component to realize. Fei-Fei Li also offers an insightful judgment: The simulator is the link connecting rendering and planning, and the core hub of the entire system.

The company performing most excellently in the field of simulators is not OpenAI, Anthropic, or Google, but Jensen Huang's NVIDIA.

NVIDIA's Omniverse claims to support trillion-dollar digital twin dreams precisely because it grasps the essence of the simulator. On NVIDIA's platform, the operations of factories, supply chains, and warehouses have all become complete digital mirrors. For the industrial world, this is no longer a visual demo but a core infrastructure for productivity.

This is not an exaggeration but a trillion-dollar market opportunity visible to all.

From virtual visualization in architectural engineering to molecular dynamics simulations in the pharmaceutical industry, and scenario testing for autonomous driving. What these industries lack is not vivid image or video generation models, but a high-fidelity simulator. It's no exaggeration to say that mastering the ability to simulate the physical world equates to holding a priority ticket for AI industrialization.

But the difficulties in reality leave this field with almost no technological optimists. Fei-Fei Li also admits that a huge gap persists.

First is the issue of embodied intelligence data, which we have repeatedly mentioned before. Video data on the internet is abundant, but 3D data with explicit geometric structure, material properties, and physical feedback annotations is extremely scarce.

Second, the application of generative AI will always be accompanied by hidden risks. AI-generated geometric models can at best achieve visual perfection but are often physically unreasonable—like cups intersecting with tabletops, or objects colliding and losing volume. In human terms, the brief phrase "clipping through" can summarize these bizarre phenomena, but in real industrial applications, this spells disaster.

04 Toward a Unified World Model

Despite the immense difficulties, Fei-Fei Li offers a positive prediction of industry trends: The boundaries between rendering, simulation, and planning are becoming increasingly blurred.

This is not a distant vision but a reality already unfolding. After exploration, Fei-Fei Li's World Labs team believes humanity is already moving towards a unified foundation model. In this architecture, imagination and logic can merge into one.

The models of the future will no longer be a patchwork of single-function add-ons, but a unified neural network foundation. It can simultaneously render realistic scenes via Gaussian splatting and generate the collision meshes required by physics engines in real time. Simply put, a unified foundation model will achieve seamless switching between the visual patterns humans need and the state patterns physics engines require.

From another perspective, traditional models are static, while future world models will possess stronger interactivity. Renderers will no longer be passive video generators but will gradually begin to accept action instructions; simulators will become more editable and controllable; planners will also be capable of logical reasoning, automatically adjusting strategies based on environmental changes.

05 The Long Arc of Spatial Intelligence

Finally, returning to the macro level, why is all this about "world models" important?

In Fei-Fei Li's view, decades of AI research have been searching for that key to allow machines to enter the physical world. Today, we already possess language models adept at handling logic; what we need next are models that handle space. The core of spatial intelligence lies in how machines interact with the physical world they inhabit.

This battle is not about who possesses more computing power, but about who can define the digital standard for the physical world.

World models are by no means a simple algorithmic optimization, but a grand feat of AI evolution.

"Language gives machines the ability to talk about this world, while world models are the way machines ultimately understand, imagine, reason, and interact with the physical world."

Every person in this era is transitioning from the stage of talking about the world toward a new epoch of truly understanding and reconstructing it.

Nonetheless, world models are merely an intermediate node on the path to AGI, and the AI created by humans still has a long way to go before reaching a truly meaningful "world model." Here, the somewhat extreme view of another world model luminary, Yann LeCun, is worth sharing:

Optimistically, it will take at least another five to ten years for machine intelligence to barely approach that of a puppy.

This article is from the WeChat public account "Silicon-Based Spark," author: Siqi

Trending Cryptos

CitreaCTR

wrapped stUSDTWSTUSDT

Prediction Markets Surpassed Cryptocurrencies: Robinhood Earned $1.31 Billion in Q2

Robinhood posted record Q2 2026 revenue of $1.31 billion and net income of $573 million. Notably, nearly 23% of the profit came from one-time items, primarily the deconsolidation of a ventures fund. A key shift was the explosive growth of its prediction markets business, with event contract revenue surging over 10x year-over-year to $156 million, surpassing revenue from stocks ($129M) and cryptocurrencies ($100M). Cryptocurrency revenue declined 38% organically, but the company is aggressively expanding its crypto infrastructure through acquisitions, launching its own blockchain (Robinhood Chain), and offering new products like tokenized stocks and perpetual futures. Overall transaction revenue grew 44% to $776 million, with options remaining the largest single component at $342 million.

cryptonews.ru8m ago

Prediction Markets Surpassed Cryptocurrencies: Robinhood Earned $1.31 Billion in Q2

cryptonews.ru8m ago

Dash Launches ZK-Privacy Based on Zcash

Dash has launched ZK-based privacy features built on Zcash's Orchard protocol. This is now integrated directly into the Dash Platform's consensus (v4.0.0), providing a native, non-custodial privacy model alongside the existing CoinJoin system. It offers optional confidentiality, hiding sender, receiver, amount, and balance within a shielded pool. The implementation adapts Zcash's Orchard protocol (using Halo 2 without a trusted setup) to Dash Platform, handling state transitions, GroveDB storage, fees, and wallet sync. A critical double-spend vulnerability discovered in Orchard prior to release was patched using updated libraries, with Dash adding a SumTree structure for extra security. Users can privately transfer Platform Credits (pegged 1:100 billion to DASH) and move funds to/from the main Dash Core chain. Wallet synchronization, which is required to detect incoming private transactions, took about 21 seconds in testing, noted as significantly faster than a comparable Zcash wallet. Future plans include extending shielded pools to native Platform tokens and NFTs, aiming to combine ZK-privacy with token issuance and direct exchangeability with Dash. Mobile wallet support for the new feature is expected soon.

cryptonews.ru10m ago

cryptonews.ru10m ago

Well-known Investor Targets Nvidia: AI Boom Overly Reliant on It

Renowned investors Mark Cuban and Michael Burry have issued warnings about Nvidia's central role in the AI ecosystem, drawing parallels to the dot-com bubble. Cuban expressed concerns that Nvidia, by financing its clients' purchases of its GPUs for data centers, has positioned itself as a critical but vulnerable linchpin in the AI boom. He warned that a breakthrough from a competitor or a single misstep could trigger a widespread collapse. Similarly, Michael Burry pointed to the parabolic rise in the cost of insuring Nvidia's debt (CDS) as a sign of market fears over the company's overextension. He suggested Nvidia has pushed cyclical spending to an epic scale through numerous financing deals, creating significant risk. Both investors cautioned that overinvestment in AI infrastructure and potential oversupply could lead to a painful downturn, with Burry also revealing he has increased his short position and holds put options against Nvidia.

marsbit11m ago

Well-known Investor Targets Nvidia: AI Boom Overly Reliant on It

marsbit11m ago

OpenAI, Open-Sourced

OpenAI has open-sourced its code security tool, Codex Security CLI. The tool, which originated from the private beta project Aardvark in October 2025, is designed to automatically discover, verify, and fix vulnerabilities in codebases. It functions as an application security agent, first analyzing a repository to build a threat model, then identifying and ranking vulnerabilities based on real-world impact, and finally testing them in a sandbox for validation. According to OpenAI, in its first 30 days, the tool scanned over 1.2 million commits, uncovering 792 critical and 10,561 high-severity vulnerabilities, with a reported reduction of over 50% in false positives upon repeated scans of the same repositories. However, initial user experiences on platforms like Hacker News highlighted significant issues, particularly concerning cost and reliability. Developers reported failed scans that consumed substantial portions of API rate limits and incurred high expenses, with one user noting a cost of approximately $13 for an aborted run. The high cost is attributed to the tool's default configuration, which uses the premium GPT-5.6-sol model with inference intensity set to "extra-high." The release follows public statements by NVIDIA's Jensen Huang advocating for open-source AI. While OpenAI has open-sourced the application-layer CLI and SDK, the core AI models remain proprietary. The move opens the door for community development and potential adaptations of the tool.

marsbit13m ago

marsbit13m ago

Morgan Stanley Launches Two More Crypto ETFs, Shatters Market with 0.14% Fee Rate

Morgan Stanley launched two new crypto ETFs, the Morgan Stanley Ethereum Trust (MSSE) and the Morgan Stanley Solana Trust (MSOL), which debuted with a combined trading volume of $38 million. MSSE saw net inflows of $5.15 million, while MSOL saw high trading volume but no net new shares. Both products offer staking rewards. Morgan Stanley is competing aggressively on price, charging a total management and staking fee of just 0.14%, significantly undercutting major competitors like Bitwise, Grayscale, and BlackRock in both the Ethereum and Solana ETF sectors. While these new funds face the challenge of catching up to the substantial assets and liquidity of established first-mover ETFs, analysts note Morgan Stanley's distinct advantage lies in its vast wealth management network, which includes thousands of financial advisors and trillions in client assets, potentially allowing it to tap into a much broader mainstream investor base.

marsbit15m ago

Morgan Stanley Launches Two More Crypto ETFs, Shatters Market with 0.14% Fee Rate

marsbit15m ago

Trading

Spot

Hot Articles

Hot Tokens Learning Week 8: ADA's Ouroboros Leios Mainnet Expected to Launch in 2026

ADA's Ouroboros Leios mainnet is expected to launch in 2026, and the hard fork to Protocol Version 11 is planned for Q1 2026.

40.9k Total ViewsPublished 2026.02.10Updated 2026.02.12

Hot Tokens Learning Week 8: ADA's Ouroboros Leios Mainnet Expected to Launch in 2026

Hot Tokens Learning Week 14: Glamsterdam Set to Be Ethereum's Most Closely Watched Upgrade in 2026

Ordinals/Runes continue to drive block fee revenue and developer activity, and are seen as the starting point for Bitcoin's "native asset issuance".

27.8k Total ViewsPublished 2026.04.29Updated 2026.04.29

Hot Tokens Learning Week 14: Glamsterdam Set to Be Ethereum's Most Closely Watched Upgrade in 2026

Hot Tokens Learning Week 19: RWA and Infrastructure Stay in Focus; Pump Platform's Daily Trading Volume Returns to Recent Highs

Recently, Robinhood Chain adopted Chainlink as its official oracle and CCIP provider.

28.0k Total ViewsPublished 2026.07.22Updated 2026.07.24

Hot Tokens Learning Week 19: RWA and Infrastructure Stay in Focus; Pump Platform's Daily Trading Volume Returns to Recent Highs

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of S (S) are presented below.

Fei-Fei Li's Manifesto for World Models

Abstract

01

We Need a Taxonomy

02

The Three Pillars of Building Intelligence

03

The Hundred-Billion-Dollar Hub

04

Toward a Unified World Model

05

The Long Arc of Spatial Intelligence