The world model is currently one of the hottest yet most confusing concepts for ordinary people in the AI circle. Some say it's the ability for AI to dream, others call it a simulator for autonomous driving, and still others describe it as the brain of a robot.
Fei-Fei Li, Yann LeCun, OpenAI, Google DeepMind, NVIDIA, as well as domestic giants like Alibaba, Tencent, Huawei, and automakers, each have their own definitions.
This article attempts to explain in plain language:
What problem world models aim to solve; why these scholars and big tech companies are fascinated by them; and why this concept has become an industrial battleground even before its name has been standardized.
I. Understanding in One Sentence: Letting AI Pre-enact the World in a 'Mental Sandbox'
Imagine you're standing at an intersection about to cross the street.
Your eyes see the green light, vehicles, pedestrians; your brain constructs a miniature scenario within milliseconds: if I walk now, will that car accelerate? Will that cyclist suddenly turn?
You haven't actually stepped out; you've first run through several possibilities in your mind.
Psychologists call this ability a 'mental model,' while AI researchers term it a 'world model.'
In other words, a world model is a 'mental sandbox' inside a machine.
It doesn't simply recognize what's in a scene; it can predict what will happen next and repeatedly trial-and-error without taking real action.
For autonomous driving, it can generate virtual test papers for heavy rain, blizzards, and irregular obstacles; for robots, it can let humanoid robots fall 100,000 times in a simulated world before going outside; for gaming and film companies, it could be an infinitely explorable parallel universe.
By 2026, the frequency of the term 'world model' appearing in tech reports had already surpassed the clarity of its definition.
Alibaba developed Qwen-AgentWorld, HappyOyster, Qwen-RobotWorld, targeting language worlds, virtual worlds, and physical worlds respectively; Tencent's HY-World 2.0 emphasizes 3D editable worlds; Nio, Xpeng, Li Auto prefer terms like 'driving world model' or 'world behavior model'; Huawei and Baidu seldom use the term alone in public materials.
The confusion in naming makes the concept seem like a catch-all basket.
But behind all the terms lies a common core:
Allowing the machine to first establish an internally deducible, reviewable environment before taking real action. This environment can be pixels, 3D structures, physical parameters, or abstract states. The goal is to reduce unlimited reliance on real data, compressing the real world into a data engine capable of infinite generation, infinite mistakes, and infinite retries.
The lack of unified naming precisely indicates that world models are in the early stage of transitioning from an academic concept to industrial infrastructure.
II. The Source of Thought: A WWII Psychologist and Several AI Pioneers
2.1 Kenneth Craik: The First to Talk About a 'Small Model in the Mind'
The idea of world models predates deep learning by most of a century. In 1943, Scottish psychologist Kenneth Craik, in his book 'The Nature of Explanation,' proposed that the human brain constructs 'small-scale models' of reality to predict and understand external events.
Craik was only 31 then, a scholar at the Cambridge University Psychological Laboratory, also engaged in applied psychology research in Britain during WWII.
His book was published two years before he died in a bicycle accident at the age of 33.
But the idea persisted: humans don't need to fully replicate the world; a sufficiently useful internal model allows pre-enactment before action.
This view aligns almost perfectly with the core of today's AI world models. Machines also don't need to remember every detail of the world but learn the laws governing it and deduce the future when needed.
After Craik, in the 1980s, British psychologist Philip Johnson-Laird further systematized this thought, proving that much human reasoning involves manipulating 'mental models' in the brain. He taught long-term at Princeton and Cambridge and is a key figure in cognitive science.
2.2 Marvin Minsky: The One Who Wanted Machines to Have a Common-Sense Framework
The field of artificial intelligence echoed this early on. In the 1960s, Marvin Minsky at MIT proposed 'frame theory.'
He was a co-founder of the MIT AI Lab, a 1969 Turing Award laureate, and often regarded as one of the founders of the AI discipline.
Frame theory attempted to capture human commonsense knowledge about the world using structured knowledge frames:
Entering a door requires finding the handle first; restaurants typically have tables and chairs; objects fall under gravity.
What Minsky aimed to do is exactly what world models today still haven't accomplished—giving machines a structured, deducible common-sense knowledge base of the world.
2.3 David Ha & Jürgen Schmidhuber: Bringing World Models Back to the Deep Learning Mainstream
The field of reinforcement learning approached the same goal from another path.
In 2018, David Ha and Jürgen Schmidhuber's NeurIPS paper, 'Recurrent World Models Facilitate Policy Evolution,' reintroduced the term 'world model' to the deep learning mainstream.
David Ha was at Google Brain then, later becoming an independent researcher. His work style leans towards engineering, skilled at creating impressive demos with concise architectures.
Jürgen Schmidhuber is a co-founder of the Swiss AI Lab IDSIA, one of the inventors of Long Short-Term Memory networks (LSTM), known in the AI field for being outspoken and holding independent views. He is sometimes called the 'father of modern AI,' though this title is debated, his academic influence is undeniable.
Their architecture was simple:
Use a VAE to compress high-dimensional frames into low-dimensional latent vectors, use an RNN to learn the changes of these vectors over time, then use a simple controller to train policies in 'imagination.'
The agent first dreams in the learned world model, then transfers the policy back to the real environment.
This paper was selected for a NeurIPS oral presentation, directly inspiring the later Dreamer series and turning 'world model' from a psychological concept into an engineering goal in deep learning.
III. World Models in the Eyes of Scholars
3.1 Yann LeCun: Don't Just Generate Videos, Understand Physics
Yann LeCun is French, a professor at New York University, and Chief AI Scientist at Meta.
He is one of the inventors of Convolutional Neural Networks (CNN), jointly awarded the 2018 Turing Award with Geoffrey Hinton (Fei-Fei Li's PhD advisor) and Yoshua Bengio; the trio is hailed as the 'Godfathers of Deep Learning.'
LeCun has consistently been critical of the current large language model path, believing that merely predicting the next word cannot produce true intelligence.
In 2022, in an article titled 'A Path Towards Autonomous Machine Intelligence,' he proposed that true intelligence requires a configurable predictive world model.
The goal is not generating text or images but understanding the laws of the physical world and predicting action consequences. He even criticized continuing to scale up large language models as 'nonsense,' arguing that the core of intelligence lies in learning the physical structure of the real world.
JEPA is the technical vehicle for this path. JEPA stands for Joint Embedding Predictive Architecture.
Unlike predicting the next frame in pixel space, JEPA simulates changes in world states in an abstract representation space.
An analogy: video generation models are drawing the next picture; JEPA is 'feeling' what will happen next in the mind.
The 2023 I-JEPA, 2024 V-JEPA, 2025 LeJEPA, and 2026 LeWorldModel form a continuously evolving system.
LeCun also introduced the 'System 1 / System 2' concept: System 1 is intuitive, fast reactions; System 2 involves invoking the world model for deliberate reasoning and planning.
Latest theoretical work even proves that under certain conditions, the representations learned by JEPA can establish a linear correspondence with real physical variables, meaning the model mathematically learns physical structure, not just a useful encoding.
3.2 Fei-Fei Li: Classifying World Models Using an 'Action-Observation' Loop
Fei-Fei Li is a professor of computer science at Stanford University, the primary creator of the ImageNet dataset. ImageNet catalyzed the deep learning revolution in 2012, earning her the title 'Godmother of AI.'
She previously served as Chief Scientist of AI at Google Cloud, founded World Labs in 2023 focusing on spatial intelligence and 3D world models. In 2024, she received multiple honors for promoting AI democratization and applications in healthcare, etc., and is one of the most influential Chinese scientists in AI today.
In June 2026, Fei-Fei Li and the World Labs team published a widely circulated article attempting to establish a taxonomy for the chaotic world model concept.
She referenced POMDP (Partially Observable Markov Decision Process) from reinforcement learning.
This concept sounds complex but describes a simple cycle: the agent takes an action, the action changes the world state, the agent obtains an observation, then takes the next action based on the observation.
She pointed out that all systems called world models are essentially projections of this cycle in different directions, each outputting a fragment of the cycle.
Based on this, she classified world models into three categories.
The first is Renderers, outputting observations—pixels for the human eye. Typical examples are video generation models and Google Genie 3, optimizing for visual fidelity.
The second is Simulators, outputting states—faithful world representations at geometric, physical, and dynamic levels. Typical examples are NVIDIA Omniverse and World Labs' Marble, optimizing for structural accuracy.
The third is Planners, outputting actions—answering 'what to do next' given observations and goals. Typical examples are VLA and World Action Models.
Li believes these three capabilities rely on the same underlying knowledge, and the ultimate trend is towards a unified world model.
3.3 Tsinghua FIB-Lab: Only Two Types of World Models—Understanding the World or Predicting the Future
Tsinghua University FIB-Lab is a team long researching AGI, embodied intelligence, and robot learning. FIB is typically understood as 'Future Intelligence and Brain' related lab, affiliated with the Institute for AI Industry Research, Tsinghua University.
The team has published numerous surveys and papers on world models and robotics, a significant force in domestic research on this direction.
In 2026, they released the survey 'Understanding World or Predicting Future: A Comprehensive Survey of World Models,' dividing the field in another way.
They classified the core functions of world models into two broad categories: Understanding the World and Predicting the Future.
Understanding the World emphasizes constructing implicit representations of the external environment to support decision-making, represented by the Dreamer series and world knowledge based on large language models.
Predicting the Future emphasizes explicitly generating future states, typified by video or 3D environment generation models like Sora, Genie 3, Cosmos.
This classification's advantage is being closer to engineering practice: the former serves reinforcement learning and decision-making, the latter serves generation and simulation.
3.4 Peking University OpenWorldLib: Making a Standardized Toolbox for World Models
In April 2026, Peking University jointly with institutions like Kuaishou released OpenWorldLib. Peking University is a domestic powerhouse in AI foundational research, housing institutions like the Key Laboratory of Machine Perception and Intelligence (MoE); Kuaishou is a domestic short-video giant, investing heavily in large models and multimodal generation in recent years.
Their joint release of OpenWorldLib shows both academia and industry are realizing world models need unified standards and reusable components.
OpenWorldLib first attempted a standardized definition for world models: a model or framework with perception as its core, possessing interactive and long-term memory capabilities, used for understanding and predicting the complex world.
They criticized equating world models simply with 'predicting the next frame' as too narrow, believing true world models must embody genuine understanding of physical laws.
OpenWorldLib splits world models into five core modules: Operator, Synthesis, Reasoning, Representation, Memory, coordinated by a pipeline module.
This framework resembles a toolbox, aiming to let different research teams combine modules like building blocks.
IV. World Models in the Eyes of Big Tech
4.1 OpenAI: Sora as a 'World Simulator'
OpenAI is currently one of the most influential AI companies globally. It is famous for the GPT series of large language models and ChatGPT. After releasing Sora in 2024, it again sparked global attention on video generation and world simulation.
In February 2024, OpenAI released Sora's technical report titled 'Video Generation Models as World Simulators,' directly positioning video generation models as world simulators. Sora doesn't rely on explicit 3D modeling or physics engines but trains generative models on massive video data, enabling emergent abilities like 3D consistency, long-term coherence, object permanence, and simple world interactions.
OpenAI believes large-scale scaling of video generation models is a promising path to building a general simulator of the physical world.
But Sora's limitations are evident: inability to accurately simulate basic physics like glass breaking, inconsistencies in long samples, objects appearing uncontrollably. So it's more a directional statement than a mature definition.
4.2 Google DeepMind: Genie 3 as a Real-Time, Interactive General World Model
Google DeepMind was formed after Google acquired the UK AI company DeepMind in 2014; Demis Hassabis is the co-founder and CEO.
DeepMind developed milestone systems like AlphaGo and AlphaFold, one of the global frontiers in AI research. Demis Hassabis himself is a computer scientist, neuroscientist, and game designer, long focused on AGI.
In August 2025, Google DeepMind released Genie 3, officially defined as 'the first real-time, interactive, photorealistic world model.'
It can generate explorable 3D environments from simple text descriptions, runs at 20-24 fps, supports character control, promptable world events, and interactive memory up to one minute. Genie 3 generates frames autoregressively, anchors the real world using Google Maps street view data, and is positioned as a key milestone towards AGI.
4.3 NVIDIA: Cosmos as the 'World Foundation Model' for Physical AI
NVIDIA was founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem, with Jensen Huang long serving as CEO. The company started with graphics chips (GPUs) and became the core supplier of global AI infrastructure over the past decade due to exploding demand for AI training compute.
Jensen Huang frequently proposes judgments like 'Physical AI' and 'The next wave of AI is robotics.' NVIDIA also continuously launches software/hardware platforms for robotics, autonomous driving, and simulation.
In January 2025, NVIDIA released Cosmos, positioned as a 'World Foundation Model Platform.' It's not a single model but a series of physics-aware video models that can predict and generate future states of virtual environments, divided into Nano, Super, Ultra tiers, trained on 20 million hours of real-world data.
Cosmos's ambition is to become the underlying infrastructure for Physical AI, serving robotics, autonomous driving, industrial simulation, etc.
NVIDIA also open-sourced it, allowing commercial use.
4.4 Domestic Giants: Not Calling It World Models, But Doing World Models
Domestic enterprises rarely provide philosophical definitions in public materials, instead directly landing on products and scenarios.
Alibaba's three products cover language world simulation, virtual world generation, and robot physical world respectively;
Tencent's HY-World 2.0 focuses on 3D editable worlds; ByteDance's Seed world model aims to reach Genie 3's SOTA level by year-end;
Huawei's Pangu Model Intelligent Driving Edition emphasizes physical law learning and closed-loop simulation; Baidu Apollo ADFM integrates world model capabilities into the autonomous driving large model; Xiaomi's OneVL attempts to unify VLA with world models.
Among automakers, Nio's NWM, Li Auto's reconstruction plus generation world model, Xpeng's X-World, Geely's WAM, BYD's pre-research, Great Wall's VLA plus world model, core uses are end-to-end intelligent driving training and long-tail scenario generation.
V. Three Technical Paths: Drawing, Mental Calculation, Building Blocks
From an engineering perspective, current world models roughly have three main technical paths, understandable through three metaphors.
The first is the 'Drawing' path, i.e., generative video models. Sora, Genie 3, Cosmos, Kuaishou's Kling, Pika belong here. Core ability is generating future frames in pixel space; advantage is strong visual realism, low data threshold, easily understandable. Disadvantage is weak physical consistency; watching longer reveals object distortion, gravity failure, timeline confusion.
The second is the 'Mental Calculation' path, represented by LeCun's JEPA and Ha & Schmidhuber's RNN world model. Core idea is not predicting pixels but predicting abstract representations. Advantage is high efficiency, more stable learning of physical structure; disadvantage is poor interpretability of representation space, long engineering implementation cycles. It's more like an athlete's intuition: not needing to mentally play the action frame-by-frame to anticipate the ball's landing.
The third is the 'Building Blocks' path, represented by NVIDIA Omniverse, World Labs Marble, Tencent HY-World. Core idea is directly generating 3D environments with geometric, physical, dynamic properties. Advantage is precise, controllable, editable, verifiable; disadvantage is scarce data, high computational cost, limited generalization. It's more like an engineer's CAD software—precisely measurable, repeatedly adjustable, but distant from the natural world.
The three paths currently have their own territories, but boundaries are blurring. Video generation models are adding physical constraints; 3D simulators are introducing generative capabilities; JEPA architectures are merging with VLA into WAM. The unified world model predicted by Fei-Fei Li is precisely the result of their fusion.
VI. World Action Model: From 'Seeing the World' to 'Taking Action'
In May 2026, the Fudan OpenMOSS team jointly with multiple institutions released a WAM survey, formally proposing the World Action Models paradigm.
Fudan OpenMOSS is one of the earliest teams promoting the large model open-source ecosystem domestically; the Mooss series models have high recognition in the Chinese community.
WAM's core definition: Future state prediction and action generation must be jointly learned within the same policy, not training a VLA first then attaching a world model as an auxiliary.
A通俗对比: VLA is 'see the scene, understand the instruction, then take action'; world model is 'know the current state and action, can imagine the next frame'; WAM is 'see the scene, understand the instruction, simultaneously imagine the next frame and take action.'
These three combined are the true 'unity of knowledge and action' ability robots need.
WAM is divided into Cascaded and Joint architectures.
Cascaded generates future frames first then decodes actions, easier to build engineering-wise but higher latency, errors easily propagate. Joint uses a single model to simultaneously output future and action, theoretically more robust but complex training objective design.
NVIDIA's Jim Fan even asserted at the 2026 Sequoia AI Ascent conference, 'VLA is dead, world action models are the future.' Jim Fan is a senior research scientist at NVIDIA, head of the GEAR team, researching robotics, simulation, embodied intelligence.
Though controversial, this statement highlights the field's热度.
VII. Industry Framework: A Three-Tier Structure Has Formed
The world model industry chain is transitioning from papers and demos to layered infrastructure. Imagine building a house: some mine and smelt steel, some produce prefabricated panels, some build residences, malls, factories on top.
The upstream is the Basic Support Layer, including high-precision data collection, computing services, and sensor hardware.
Data collection involves HD maps, spatial scanning, video采集, teleoperation; computing services center on GPUs and cloud servers; sensor hardware includes LiDAR, cameras, IMUs. NVIDIA, with GPUs, holds an invisible霸主 position here; almost all world model training relies on its computing power.
Cost is the core pain point: training trillion-parameter world models requires thousands of GPUs, single training costs can reach millions of dollars.
The midstream is the Technology Platform Layer, divided into general-purpose platforms and vertical platforms.
General-purpose platforms provide cross-industry通用能力, represented by NVIDIA Omniverse, SenseTime OpenDIL, Huawei Pangu, Alibaba Tongyi series. Vertical platforms focus on specific industries, like autonomous driving world models, architectural world models, embodied intelligence world models. Platform companies are gaining dominance through ecosystem integration,预计到2030年 may occupy over 50% of the industrial chain's market share.
The downstream is the Scenario Application Layer, covering autonomous driving, embodied intelligence, smart construction, gaming/entertainment, spatial services, medical simulation, climate prediction, etc.
Automotive, electronics, healthcare are believed to contribute over 60% of current industry revenue. Autonomous driving is the most mature application scenario;几乎所有主流车企 have incorporated world models into core R&D processes; embodied intelligence is the most promising新兴方向; over 60% of industrial robots use world models for辅助训练.
VIII. Why Lack of Conceptual Unity is Actually Good
The chaos surrounding the world model concept often makes outsiders think it's a hyped-up trend.
But from an industrial history perspective, lack of conceptual unity is often the norm in the early stages of a technological revolution.
Early cloud computing had IaaS, PaaS, SaaS debates; early big data had Hadoop, NoSQL, data warehouse debates; early AI even had symbolism, connectionism, behaviorism debates. Naming分歧 reflects different groups approaching the same宏大问题 from different angles.
The current分歧 in world models is essentially a debate over what form the 'world' should be compressed into.
Video generation folks see the world as pixel sequences; 3D engine folks see it as geometry and physics; autonomous driving folks see it as traffic rules and driving behaviors; robotics folks see it as action consequences.
Each compression method corresponds to different data, compute, and application scenarios. In the industry's early stage, such分歧 is necessary, allowing parallel exploration of different paths.
But beneath the分歧, goals have converged.
Whether it's LeCun's JEPA, Fei-Fei Li's POMDP loop, Sora's video generation, Genie 3's 3D interaction, or various domestic giants' products, all ultimately point to the same capability: endowing machines with an internal world that is deducible, reviewable, and generalizable, enabling them to act safer, more efficiently, and more generally in the real world.
Language models gave machines the ability to talk about the world; world models attempt to give them the ability to understand, imagine, reason, and interact with the world.
The concept will unify, but that will happen after the landscape settles. Until then, the chaos in naming is precisely the标志 of world models entering the main battlefield.
This article is from the WeChat public account 'IT桔子' (ID: itjuzi521), author: Judy








