"The world is everything that is the case."
In 1921, Ludwig Wittgenstein wrote this famous sentence in *Tractatus Logico-Philosophicus*. A century later, it is quoted by AI pioneer Fei-Fei Li as the opening of her latest technical blog post.
In the landscape of deep learning, people have become accustomed over the past three years to AI's disruptive impact on language, starting with ChatGPT which endowed machines with expression, programming, and reasoning abilities far surpassing humans.
However, behind this digital miracle lies a blind spot that is often overlooked: machines can talk about the world, yet remain ignorant of its physical essence. The blog post released by Fei-Fei Li serves as a sobering reality check.
Today, as generative AI has become an indispensable tool globally, the industry's internal definition of "world models" is becoming increasingly chaotic. Whether in video generation or embodied intelligence, various companies are vying for the interpretive authority of this concept.
After Fei-Fei Li published this blog post, many believed she was attempting to reclaim the definition of "world models." But on the contrary, I think what Fei-Fei Li truly aims to do is to issue a declaration: The world is not constituted by language, but by the rigorous laws of physical space and time.
For machines to truly step into the human physical world, they must break free from the comfort zone of text statistics and instead understand the refraction of light, the inertia of objects, and the logic of collisions. This is not only a paradigm shift in technology but also a necessary path for AI's advancement toward embodied intelligence.
01
We Need a Taxonomy
It must be admitted that in the AI lexicon, "world model" has devolved into a catch-all pronoun; any project involving image generation or environment simulation seems capable of being linked to it. This ambiguity stems precisely from the multi-dimensional human need to define the "world."
When a technology is just starting out, there naturally won't be unified doctrines to confine it within clear boundaries. This chaos in defining "world models" is not uncommon in history. When ancient Greek philosophers debated whether the essence of the world was water, fire, or indivisible atoms, they were essentially searching for a cornerstone for their reasoning.
The AI field now faces a similar problem: When a video generation model produces visuals that are extremely realistic yet physically impossible, how should we define it? Fei-Fei Li's blog mentions an ancient and robust foundational definition: the Partially Observable Markov Decision Process (POMDP).
This is also the core axiom of reinforcement learning mechanisms, revealing the eternal closed loop of interaction between an agent and the physical world: The agent takes an Action, leading to a change in the world's State. However, the agent lacks a god's-eye view and can only construct a partial perception of reality through Observation.
Essentially, a world model is the abstract model of the world that a machine builds in its "brain" to survive within this closed loop. If any part of this loop is not clearly defined, then the so-called world model remains merely a blind stacking of pixels.
02
The Three Pillars of Building Intelligence
This loop sounds simple, with each component's function easily understood. However, upon careful analysis, each contains countless details with blurred definitions. To explain the chaos within, Fei-Fei Li deconstructs world models into three core components. They serve both as a technical taxonomy and as the three pillars for AI's journey toward embodied intelligence.
1. Renderer
The core logic of the renderer is visual plausibility. Its output is pixels, striving to make the imagery appear natural, coherent, and aesthetically pleasing to the human eye.
This is currently the most mature field commercially. Models we are familiar with, such as OpenAI's Sora and ByteDance's Seedance 2.0 for video generation, and OpenAI's GPT-image-2 and Google's Nano Banana 2 for image generation, are essentially the most sophisticated visual probability machines available. By learning from billions of internet images and videos, they have ultimately mastered the distribution patterns of light, shadow, and form.
This seemingly beautiful reality comes at a cost, as Fei-Fei Li points out. While these top models can generate magnificent architecture, attempting to interact within their generated physical structures would likely cause the building to collapse instantly due to a lack of support structure. In other words, they don't understand what "support" is; they generate only what the viewer "sees," not what the world "is."
2. Simulator
What the simulator pursues is precisely the structural fidelity that the renderer lacks. It doesn't care at all whether a video looks good; its sole concern is whether the world follows physical laws. When a simulator outputs a mundane cup, it must include the cup's mass distribution, material friction coefficient, gravity response, and physical boundaries during collisions.
With a simulator, the content in videos gains a claim to realism. However, simulators are not only severely underestimated but often outright ignored in the current AI wave.
From the case of the cup above, the existence of a simulator transforms "discussing art" into "studying physics." Constructing a simulator that strictly adheres to physical laws requires unimaginable computational resources and annotation costs. But for robots, visual aesthetics are almost a useless attribute; physical precision determines everything.
If a simulator isn't accurate enough, robots trained within it can never enter the real world. The Sim-to-Real challenge is objectively real. Test actions that pass 100% in the lab can be completely paralyzed by minute friction in the real world—this is what we often call the "Moravec's paradox."
3. Planner
The planner is responsible for action output. As the connection point between perception and feedback, it needs to solve the core question with no standard answer: "What should be done next?" In Fei-Fei Li's framework, this is also the final component of the entire "perception-action" closed loop and simultaneously the most frontier-challenging domain.
All current Vision-Language-Action (VLA) models are attempting to enable systems to make decisions in unstructured, complex worlds. The planner doesn't merely predict the future; it chooses, from countless possibilities, the path most likely to achieve the goal. It is the key for machines to evolve from "observers" into "practitioners."
03
The Hundred-Billion-Dollar Hub
Among the three categories Fei-Fei Li outlines, models corresponding to the renderer and planner are relatively common; the remaining simulator has logically become the most difficult component to realize. Fei-Fei Li also offers an insightful judgment: The simulator is the link connecting rendering and planning, and the core hub of the entire system.
The company performing most excellently in the field of simulators is not OpenAI, Anthropic, or Google, but Jensen Huang's NVIDIA.
NVIDIA's Omniverse claims to support trillion-dollar digital twin dreams precisely because it grasps the essence of the simulator. On NVIDIA's platform, the operations of factories, supply chains, and warehouses have all become complete digital mirrors. For the industrial world, this is no longer a visual demo but a core infrastructure for productivity.
This is not an exaggeration but a trillion-dollar market opportunity visible to all.
From virtual visualization in architectural engineering to molecular dynamics simulations in the pharmaceutical industry, and scenario testing for autonomous driving. What these industries lack is not vivid image or video generation models, but a high-fidelity simulator. It's no exaggeration to say that mastering the ability to simulate the physical world equates to holding a priority ticket for AI industrialization.
But the difficulties in reality leave this field with almost no technological optimists. Fei-Fei Li also admits that a huge gap persists.
First is the issue of embodied intelligence data, which we have repeatedly mentioned before. Video data on the internet is abundant, but 3D data with explicit geometric structure, material properties, and physical feedback annotations is extremely scarce.
Second, the application of generative AI will always be accompanied by hidden risks. AI-generated geometric models can at best achieve visual perfection but are often physically unreasonable—like cups intersecting with tabletops, or objects colliding and losing volume. In human terms, the brief phrase "clipping through" can summarize these bizarre phenomena, but in real industrial applications, this spells disaster.
04
Toward a Unified World Model
Despite the immense difficulties, Fei-Fei Li offers a positive prediction of industry trends: The boundaries between rendering, simulation, and planning are becoming increasingly blurred.
This is not a distant vision but a reality already unfolding. After exploration, Fei-Fei Li's World Labs team believes humanity is already moving towards a unified foundation model. In this architecture, imagination and logic can merge into one.
The models of the future will no longer be a patchwork of single-function add-ons, but a unified neural network foundation. It can simultaneously render realistic scenes via Gaussian splatting and generate the collision meshes required by physics engines in real time. Simply put, a unified foundation model will achieve seamless switching between the visual patterns humans need and the state patterns physics engines require.
From another perspective, traditional models are static, while future world models will possess stronger interactivity. Renderers will no longer be passive video generators but will gradually begin to accept action instructions; simulators will become more editable and controllable; planners will also be capable of logical reasoning, automatically adjusting strategies based on environmental changes.
05
The Long Arc of Spatial Intelligence
Finally, returning to the macro level, why is all this about "world models" important?
In Fei-Fei Li's view, decades of AI research have been searching for that key to allow machines to enter the physical world. Today, we already possess language models adept at handling logic; what we need next are models that handle space. The core of spatial intelligence lies in how machines interact with the physical world they inhabit.
This battle is not about who possesses more computing power, but about who can define the digital standard for the physical world.
World models are by no means a simple algorithmic optimization, but a grand feat of AI evolution.
"Language gives machines the ability to talk about this world, while world models are the way machines ultimately understand, imagine, reason, and interact with the physical world."
Every person in this era is transitioning from the stage of talking about the world toward a new epoch of truly understanding and reconstructing it.
Nonetheless, world models are merely an intermediate node on the path to AGI, and the AI created by humans still has a long way to go before reaching a truly meaningful "world model." Here, the somewhat extreme view of another world model luminary, Yann LeCun, is worth sharing:
Optimistically, it will take at least another five to ten years for machine intelligence to barely approach that of a puppy.
This article is from the WeChat public account "Silicon-Based Spark," author: Siqi








