The War Without a Unified Name: The Domestic Tech Giants' World Model Landscape

marsbit发布于2026-06-25更新于2026-06-25

文章摘要

The article outlines the diverse and fragmented landscape of "World Models" in China's tech industry, where major players are pursuing similar goals under different names like world foundational models, physical AI, or integrated within autonomous driving and embodied intelligence systems. The core aim is to enable AI to create an internal, dynamic environment for simulation, reasoning, and learning, reducing reliance on infinite real-world data. This "data engine" allows for unlimited generation, experimentation, and iteration. The report categorizes the approaches of different companies: * **Internet Giants:** Alibaba is developing models for linguistic, virtual, and physical worlds (Qwen-AgentWorld, HappyOyster, Qwen-RobotWorld). Tencent's HY-World focuses on 3D, game, and social scenarios. ByteDance leverages its vast video data for a potential "digital twin" model. Huawei integrates its model into industrial applications like smart cars and robotics without separately branding it. Baidu embeds world model capabilities within its Apollo autonomous driving and Ernie systems. * **Automakers:** Companies like NIO, Li Auto, XPeng, and Geely are using world models as virtual "driving schools" and "testing grounds." They generate complex scenarios (e.g., rain, snow) to train and validate autonomous driving systems in simulation, aiming for more capable and safer AI drivers. * **Autonomous Driving Suppliers:** Firms such as Momenta, Horizon Robotics, Haomo.ai, and DeepRo...

The name "World Model" still lacks a unified business card in the industry. Some call it World Model, some World Foundation Model, some Physics AI, and others hide it within the architecture of autonomous driving large models, VLAs, or embodied intelligence systems without giving it a separate name.

Alibaba's Qwen-AgentWorld, HappyOyster, and Qwen-RobotWorld point to language worlds, virtual worlds, and physical worlds respectively; Tencent's HY-World leans towards 3D editable worlds; Automakers prefer terms like Driving World Model or World Behavior Model; Huawei and Baidu simply don't explicitly shout the four words "World Model".

Behind the chaos in naming, everyone is essentially doing the same thing:

Before a machine takes real action, let it first establish a dynamic environment internally that can be reasoned through and reviewed, reducing the endless reliance on real-world data, compressing the real world into a data engine capable of infinite generation, infinite mistakes, and infinite do-overs.

While startups are still grappling with data acquisition rights and computing power budgets, Alibaba, Tencent, Huawei, NIO, Xpeng, and Li Auto have quietly paved a new track with world models.

World Model is an ambition: to make AI go beyond recognizing the world, to first run through the world in its mind.

Autonomous driving manufacturers want to use it to generate "exam papers" for rainy days, snowy days, and irregular obstacles; embodied intelligence teams want to use it to let robots fall enough 100,000 times in simulation before going outside; gaming and social companies want to use it to create a parallel universe humans can immerse themselves in.

Major tech companies have entered the arena with different focuses, but the core goal is consistent: to compress the real world into a data engine capable of infinite reasoning and review.

I. Internet Giants:

From Digital to Physical Worlds

Alibaba's world model layout most resembles "placing items one by one on the shelf."

In June 2026, it threw out three cards within a few days:

The Qwen-Robot series on June 16th, HappyOyster 1.0 on June 17th, and Qwen-AgentWorld on June 24th.

Qwen-AgentWorld is a native language world model; it doesn't generate pictures, but environments—within seven environments (MCP tools, search, terminal, code engineering, Web, operating systems, Android), the model can simulate real interactions, self-learn, and refine itself with reinforcement learning. It offers two sizes: MoE architectures with total parameters of 35B and 397B, activated parameters of 3B and 17B respectively; training data comes from over 10 million real-world interaction trajectories; both the model and the evaluation benchmark AgentWorldBench are open-source. This essentially makes the world model an agent's "training ground" rather than a "decoration."

HappyOyster 1.0 puts on a different face, more like a "playable movie set": users provide a sentence or picture, it generates an open world, and lets users intervene arbitrarily in two modes: "World Exploration" and "Real-time Director." Exploration mode supports continuous real-time displacement and camera control for up to 1 minute; Director mode can generate 480p/720p real-time footage for over 3 minutes. Alibaba positions it as an entry point for industries like interactive games, virtual companionship, interactive short videos, and cultural tourism experiences.

Qwen-RobotWorld goes in yet another direction; it is the "thinking brain" within Alibaba's embodied intelligence trifecta, working alongside the VLA operation model Qwen-RobotManip and VLN mobility model Qwen-RobotNav, aiming to give robots an inner world that can be previewed.

Taken together, these three are Alibaba simultaneously vying for definition power over language worlds, virtual worlds, and physical worlds.

Tencent's Hunyuan takes another path; its HY-World series is more like building an "automatic factory for 3D games."

In July 2025, Tencent open-sourced and released the Hunyuan 3D World Model 1.0 at WAIC; upgraded to 1.5 in December; released and open-sourced HY-World 2.0 in April 2026. Inputs can be text, single image, multiple images, video, or even white models; outputs can be 3DGS, Mesh, point clouds.

Version 2.0 introduced modules like HY-Pano 2.0, WorldNav, WorldStereo 2.0, and WorldMirror 2.0, linking world generation, world reconstruction, panorama images, and real-time world generation into a closed loop.

Tencent's advantage lies in gaming and social scenarios; HY-World's real users are not training autonomous driving, but creating game levels, virtual shooting, and digital twins.

ByteDance's world model project is more like a "secret march" carrying its short-video data genes.

In August 2025, The Information disclosed that ByteDance's Seed team, led by Zhou Chang (former core member of Tongyi Qianwen), was developing a world model. The biggest trump card is the daily flow of over 1 billion videos from Douyin and TikTok, and the EX-4D framework—which can turn monocular videos into 4D multi-view scenes. It targets Google's Genie 3 and Meta's V-JEPA 2, aiming not for a pretty video generator but to build a "digital twin" that simulates physical laws.

At the Volcengine FORCE Original Power Conference on June 23, 2026, ByteDance did not directly release this world model but showcased the Doubao Seed 2.1 series, Seedance 2.5 video generation model, Seedream 5.0 Pro image generation model, and a new audio generation model.

And 36Kr's exclusive report summarized ByteDance's 2026 AI strategy into four propositions: the world model should reach global SOTA by year-end, Seedance explores dynamic generation, Coding consolidates the foundation, and Doubao accelerates commercialization.

This means the world model is ByteDance's first proposition internally, but it chooses to let Seedance and Doubao stand on the front stage first, while it continues to build up a big move in the background.

Huawei's Pangu World Model has an aura of being "low-key but lethal."

At the developer conference in June 2025, Huawei released the Pangu large model, based on the Pangu multimodal large model, with the core capability of generating high-precision digital-physical spaces from single images. It can predict collisions, train robotic arms for grasping, and generate driving videos and LiDAR point clouds, helping Huawei's ADS end-to-end model achieve "one version every two days."

Huawei does not shout the slogan "World Model," but treats it as the "training foundation" for smart cars and embodied intelligence. The cooperation with GAC is a typical case: 2D videos and 3D point clouds correspond pixel-by-pixel, restoring complex corner cases in minutes.

At HDC 2026 in June 2026, Huawei upgraded the Pangu large model to 7.0 and released the Ascend 910C, with Yu Chengdong returning to lead Pangu, but there was no separate news about a new version of the world model itself.

This approach of "the world model does not exist separately but serves the industrial closed-loop" is Huawei's consistent style.

Baidu entered the autonomous driving field earlier, with Apollo ADFM released in May 2024 positioned as "the world's first autonomous driving large model supporting L4-level unmanned driving."

Although Baidu does not name it a world model, it essentially possesses world model capabilities: understanding the physical world and predicting the behavior of traffic participants through end-to-end neural networks. In November 2025, the ERNIE Bot 5.0 debuted in its native full-modality form, with a parameter scale of 2.4 trillion; the official version launched in January 2026.

Baidu's world model capabilities are already hidden within a larger chess game. Baidu's strategy is: do not discuss the world model separately, but let Apollo and ERNIE complement each other.

Xiaomi and SenseTime represent two types of "technologists."

Xiaomi's open-source Xiaomi OneVL on May 13, 2026, unified VLA, world models, and latent space reasoning into one framework, emphasizing the interpretability of the visual reasoning process, creating foundational components usable for both autonomous driving and embodied intelligence.

SenseTime's Jueying's "Kaiwu" is more like a "veteran driver" already on duty; in a Frost & Sullivan report from September 2025, it was defined as the industry's first mass-produced, interactive world model, capable of generating 150-second, 1080P, 11-viewpoint driving videos, and has accumulated the industry's largest-scale generative driving dataset, WorldSim-Drive, and a tens-of-millions-level generative scene library.

In June 2026, DaXiao Robotics, founded by SenseTime co-founder Wang Xiaogang, announced the completion of hundreds of millions of dollars in financing. Its Kairos World Model 3.0 topped four major generative prediction leaderboards in dimensions like embodied video generation and task instruction following.

The SenseTime-affiliated world model is spreading from smart cars to robots.

II. Automakers:

Using World Models as Driving Schools and Test Centers

If internet giants' world models are about "creating worlds," then automakers' world models are about "using worlds."

NIO is the Chinese automaker that first waved the world model banner.

At NIO IN in July 2024, Ren Shaoqing released the NWM (NIO World Model), positioned as China's first smart driving world model.

It uses a multivariate autoregressive generative architecture to do two things: "imaginative reconstruction" in space and "imaginative reasoning" in time.

Given a real scene, it can reconstruct the 3D world; given a three-second prompt, it can generate over two minutes of future video. Every 0.1 seconds, it reasons through 216 trajectories and selects the optimal one.

NIO's logic is clear: end-to-end models are not enough; truly smart driving systems need to "imagine road conditions even with eyes closed" like humans. On June 18, 2026, NIO officially pushed the new NWM 2.0 version, covering over 700,000 users across all series. Even car owners who purchased vehicles four years ago can upgrade for free, with simultaneous releases for the four major vehicle systems: Banyan, Cedar, and Coconut+. The new version is the first in China to enable the driving model to directly output raw steering wheel and accelerator/brake pedal operation signals and upgraded the training system from "World Model + Closed-loop Reinforcement Learning" to a three-layer system of "World Model + Supervised Fine-tuning + Closed-loop Reinforcement Learning." AEB coverage scenarios are 6.7 times that of standard AEB, with false braking probability reduced to once per 100,000 kilometers.

The Shenji NX9031 chip is even described as "inherently designed for world models."

Li Auto proposed a "reconstruction + generation" world model approach in the second half of 2024 and published DrivingSphere at CVPR 2025.

It consists of the OccDreamer diffusion model and VideoDreamer ST-DiT, constructing a high-fidelity 4D closed-loop simulation environment.

Traditional open-loop simulation can only evaluate what the model "sees," while closed-loop simulation can evaluate what the model "does." Li Auto's world model is like an exam center that can infinitely generate tricky questions, letting the driving system first master challenging scenarios within the chip.

By Livis Day in June 2026, Li Auto further upgraded this capability into "Mach VLA," a native multimodal MoE architecture unifying perception, prediction, and planning, with dual M100 chip computing power of 2560 TOPS and a reaction time of 0.28 seconds.

According to Li Auto's roadmap, the new Mach VLA will be pushed to AD Max users in Q3, with the Q4 target aligned with Tesla FSD V14. Li Auto is no longer just a car company; it is shaping itself into a provider of the embodied intelligence system Livis.

Xpeng Motors' path shows a layered sense of "first make it big, then make it precise."

In April 2025, Xpeng first disclosed at an AI technology sharing event in Hong Kong that it was developing an ultra-large-scale autonomous driving "World Foundation Model" with 72 billion parameters.

A year later, on April 1, 2026, Xpeng officially released the X-World World Model technical report.

Based on video diffusion generation technology, it modifies the latent space video generation paradigm of WAN 2.2, using 3D causal VAE and viewpoint-temporal self-attention DiT, supporting consistent cross-viewpoint generation for 7 surround-view cameras.

X-World is not a video generation tool but the "real-world simulator" for Xpeng's second-generation VLA: simulation scenarios increased from 30,000 a year ago to over 500,000, daily simulation test mileage equals 30 million kilometers of actual vehicle testing, and it supports online reinforcement learning and overseas data generation.

At CVPR in June 2026, Xpeng first showcased a complete world model technology roadmap. Its ambition is written in its application scope: AI cars, AI robots, flying cars. Its training data scale target is 200 million clips, with a 10,000-card cluster providing 10 EFLOPS of computing power, iterating every 5 days.

Geely Auto showcased WAM (World Action Model) at CES 2026, integrating it into its All-domain AI 2.0 system.

WAM's layered architecture is interesting: the upper layer is MLLM (Multimodal Large Language Model) responsible for understanding, the lower layer is Action Expert responsible for actions, and the middle is the world model responsible for reasoning.

Geely's goal is not to make the driving model better but to turn the entire vehicle into "one brain"—unified scheduling of smart driving, cockpit, chassis, and powertrain. In April 2026, the Zeekr 8X launched with immediate delivery, becoming the first mass-produced vehicle in China equipped with a cabin-driving fusion super intelligence agent, its G-ASD 4.0 based on WAM. The 2026 target is highway L3 and low-speed L4.

BYD's world model is still in early R&D; information disclosed in January 2025 showed its internal team referenced the Tesla path, forming a small team for rapid trial and error, focusing on solving corner case data generation for end-to-end driving.

Great Wall Motor also proposed the next-generation driving direction of VLA + world model, moving from "strategy" to "mass production": In June 2026, at the Smart Driving & Overseas Expansion Conference, Great Wall shared VLA practices. Its Baoding Jiuzhou Supercomputing Center reached 5 EFLOPS, with over 10,000 GPUs. The Tank 700 will be the first model equipped with the Coffee Pilot 4.0 VLA system, mass-produced and installed within 2026. Over 2 million vehicles in the existing fleet generate massive data daily, which is Great Wall's most substantial asset compared to new car-making forces.

III. Smart Driving Suppliers:

The World Engine Hidden Under the Car

Outside of automakers, a group of suppliers have turned world models into "invisible engines."

Momenta officially released the R7 Reinforcement Learning World Model at the Beijing Auto Show in April 2026, achieving mass production launch.

It is a three-layer architecture: World Model Pre-training, World Model Simulation, and Reinforcement Learning. R7 is based on over 12 billion kilometers of real vehicle mileage brought by Momenta's mass production business, extracting over 100 million segments of "golden data" for pre-training, then letting the model experience massive long-tail scenarios in simulation, and finally polishing it with reinforcement learning.

Momenta directly embeds it into the end-to-end foundation model, aiming to achieve L4 standards. Commercial data is also expanding rapidly: mass-produced vehicles equipped with Momenta's systems have exceeded 900,000 units, successfully delivering over 100 mass-produced models, with cumulative designated models exceeding 210, and solutions deployed in over 10 countries and regions including the UK, Norway, Singapore, Australia, and New Zealand.

In June 2026, Momenta passed the Hong Kong Stock Exchange hearing, sprinting towards an IPO as the "first Physics AI stock" with a 65% market share in third-party city NOA. This shows its bet on world models.

Horizon Robotics released HorizonDrive in May 2026, an autoregressive world model with the core capability of minute-level long-sequence driving video generation.

It works in latent space using video-VAE, inputting high-definition maps, 3D bounding boxes, and ego-vehicle actions, then outputting continuous future scenes.

HorizonDrive's highlight is "self-correction" training: through SRR and TRD technologies, the model corrects itself when generating errors. On nuScenes, its FID decreased by 52%, FVD by 37%, and trajectory accuracy improved by 21%; a single RTX 5090 can generate 256×512 videos at 5.6 FPS, or 384×768 videos at 1.7 FPS. Its positioning is closed-loop autonomous driving simulation, helping automakers verify L3+ systems without hitting the road.

Haomo.AI's DriveGPT was one of the earliest domestic projects to shout the slogan "World Model."

The "Snow Lake·Hai Ruo" released in April 2023 is a generative autonomous driving large model, building a 4D representation space by predicting the next frame. Behind it are 10 billion frames of internet images, 4.8 million 4D clips, and 87 million kilometers of assisted driving mileage.

Haomo follows a path similar to Tesla's World Model and Wayve's GAIA-1: evolving the autonomous driving large model from "looking at pictures" to "looking at videos," then to "predicting videos." It provides capabilities for scenarios like Great Wall's Wey brand and Xiao Mo Tuo unmanned vehicles.

DeepRoute.ai released the DeepRoute IO 2.0 platform on August 26, 2025, equipped with a self-developed VLA model.

At the Beijing Auto Show in April 2026, DeepRoute further released foundation model technology and its Physics AI strategy, showcasing commercial data: mass-produced vehicles equipped with its city NOA solution exceeded 300,000 units. Over the past year, vehicles with DeepRoute's active safety systems accumulated over 1.3 billion kilometers of real-road operation mileage and 44.8 million hours of user driving time.

DeepRoute does not name a separate world model, but the world model is the implicit core within DeepRoute IO 2.0's simulation and training system.

IV. Startups and Giants:

Two Maps, One City

And this giant layout chart is another map.

Two maps point to the same city: whoever enables AI to truly understand the physical world holds the key to the next era.

Startups' advantages are focus and speed.

They can bet on an aggressive route, like native world models, 3D spatial generation, VLA physics engines, unencumbered by existing business. But they lack data, computing power, mass production channels, and a real-world scenario closed loop to continuously feed the world model.

Giants' disadvantages are organizational inertia and naming chaos caused by parallel multi-departmental efforts—Alibaba's three world model projects even confuse outsiders about whether they are the same thing. But giants have data, computing power, users, vehicles, and the engineering systems to run the models. Startups build "models," giants build "systems."

The most dangerous moment is when giants turn world models from "research projects" into "business foundations." Huawei's Pangu large model serves ADS and robots, Tencent's HY-World serves gaming and industry, Li Auto's DrivingSphere serves driving iteration, SenseTime's Kaiwu is already mass-produced in vehicles, Momenta's R7 already runs on over 900,000 vehicles—

These are not PowerPoint slides from press conferences, but "capabilities" entering product assembly lines. For startups, the window for world models is narrowing. Future competition will quickly shift from "who can build a world model" to "whose world model can be affordably and effectively used by giants."

V. World Model Is Not a Trend,

It's an Upgrade of the Old War

World Model is not a new story.

It is the natural product of language large models, video generation models, autonomous driving end-to-end models, and robot VLA models converging in the physical world.

The influx of giants indicates that this has moved from a "tech geek toy" to "industrial infrastructure."

Alibaba, Tencent, ByteDance, Huawei, Baidu, Xiaomi, and SenseTime build bridges between the digital and physical worlds; NIO, Li Auto, Xpeng, Geely, BYD, and Great Wall extend the "bridge" to cars; Momenta, Horizon, Haomo, and DeepRoute lay the tracks beneath the bridge.

Startups stand at the bridge's end, holding more exquisite blueprints, but have to face the fact that giants are mobilizing engineering teams.

In the coming year, the core question in the world model race won't be "who made one," but "whose world model is truly understanding the world on behalf of humans."

This article is from the WeChat public account: IT桔子 , author: Judy

热门币种推荐

相关问答

QWhat is the core function or goal of a 'World Model' as described in the article across different Chinese tech companies?

AThe core goal is to enable machines to create an internal, dynamic, and predictable simulation of the world before taking real action. This 'data engine' compresses the real world to allow for infinite generation, experimentation, and error correction, thereby reducing reliance on massive real-world datasets.

QHow does Alibaba's approach to world models differ from Tencent's, according to the article?

AAlibaba's approach is comprehensive, targeting three distinct domains: the language world (Qwen-AgentWorld), the virtual world (HappyOyster), and the physical world for robots (Qwen-RobotWorld). Tencent's HY-World series focuses more on creating and editing 3D worlds, leveraging its strengths in gaming and social scenarios for applications like game level design, virtual production, and digital twins.

QWhat role do World Models play for automakers like NIO, Li Auto, and XPeng?

AFor automakers, World Models primarily serve as advanced 'driving schools' and 'testing grounds.' They are used to generate complex, long-tail driving scenarios (e.g., rain, snow, obstacles) for simulation, allowing autonomous driving systems to practice and refine their decision-making in a safe, virtual environment before deployment on real roads.

QName one automotive supplier mentioned in the article that has developed a World Model and briefly describe its application or achievement.

AMomenta developed the R7 reinforcement learning World Model, which is already mass-produced and deployed. It uses over 12 billion kilometers of real driving data and is embedded into its end-to-end autonomous driving system, aiming for L4 standards. Momenta holds a 65% market share in third-party urban NOA in China and is pursuing an IPO labeled as the 'Physical AI first stock.'

QWhat challenge do startups face in the World Model field compared to large companies, as outlined in the article?

AStartups face significant challenges in data acquisition, computing power budgets, and access to mass-production channels. While they can be more focused and agile, they lack the massive, real-world, closed-loop application scenarios that large companies possess (e.g., vehicles, games, social platforms) to continuously feed and validate their models. The window of opportunity is narrowing as large firms integrate World Models into their core business systems.

你可能也喜欢

BNB链在52亿美元代币化股票交易推进中超越Solana

**BNB链在代币化股票交易量上超越Solana** **关键数据:** * BNB链累计代币化股票交易量达到52亿美元(主要由Ondo Finance贡献,占51.2亿美元),超过了Solana的45亿美元。 * **重要提示**:需区分BNB链的“代币化股票累计交易量”与Solana的“代币化股票累计转移量”这两个不同指标。 **核心内容:** BNB链在代币化股票这一新兴赛道上的交易活动已超过Solana。这一数据来源于Ondo Global Markets仪表板和DefiLlama的RWA指数。该变化反映了市场风险偏好的潜在转移,是观察当前资本流向和市场结构的信号之一,而非决定性的市场结论。 **对交易者的意义:** 代币化股票等现实世界资产产品已成为链上的重要叙事,关乎结算、准入和市场基础设施。BNB链在交易量指标上的领先显示了当前交易活动的聚集点。这类动态往往会波及相关交易领域,例如影响山寨币情绪、塑造机构头寸等,在市场流动性较薄时,其二次效应尤为重要。 **需要留意的关键点:** 加密货币市场极易将单一数据点快速放大为普遍叙事。正确的解读应更审慎:这是一个**信号**,而非**保证**。交易量数据的变化本身并不直接等同于长期持有者信心丧失或网络出现问题,其价值在于帮助理解市场参与者的头寸、信心和动机。 **后续关注点:** 下一步需观察后续数据流、链上指标、未平仓合约等是否能持续验证这一趋势。若持续,则可能成为一个更稳固的市场主题;若迅速消退,则可能只是短期头寸调整。在当前市场环境下,需结合更广泛的流动性、宏观条件和衍生品状况来综合解读这一信号。

bitcoinist6分钟前

BNB链在52亿美元代币化股票交易推进中超越Solana

bitcoinist6分钟前

交易

现货

热门文章

如何购买WAR

欢迎来到HTX.com!我们已经让购买WAR(WAR)变得简单而便捷。跟随我们的逐步指南,放心开始您的加密货币之旅。第一步:创建您的HTX账户使用您的电子邮件、手机号码注册一个免费账户在HTX上。体验无忧的注册过程并解锁所有平台功能。立即注册第二步:前往买币页面,选择您的支付方式信用卡/借记卡购买:使用您的Visa或Mastercard即时购买WAR(WAR)。余额购买:使用您HTX账户余额中的资金进行无缝交易。第三方购买:探索诸如Google Pay或Apple Pay等流行支付方法以增加便利性。C2C购买:在HTX平台上直接与其他用户交易。HTX场外交易台(OTC)购买:为大量交易者提供个性化服务和竞争性汇率。第三步:存储您的WAR(WAR)购买完您的WAR(WAR)后,将其存储在您的HTX账户钱包中。您也可以通过区块链转账将其发送到其他地方或者用于交易其他加密货币。第四步:交易WAR(WAR)在HTX的现货市场轻松交易WAR(WAR)。访问您的账户,选择您的交易对,执行您的交易,并实时监控。HTX为初学者和经验丰富的交易者提供了友好的用户体验。

774人学过发布于 2024.12.11更新于 2026.06.02

如何购买WAR

相关讨论

欢迎来到HTX社区。在这里,您可以了解最新的平台发展动态并获得专业的市场意见。以下是用户对WAR(WAR)币价的意见。

活动图片