JD.com and Former OpenAI CTO Mira Murati Have Bet on the Same AI Track

marsbitPublished on 2026-06-24Last updated on 2026-06-24

Abstract

JD.com and Mira Murati's Thinking Machines Lab are converging on the same AI frontier: proactive visual-language interaction models. JD just open-sourced JoyAI-VL-Interaction, the world's first full-stack open-source model of its kind. Unlike traditional "turn-based" AI that waits for user prompts, this model actively analyzes continuous video streams, autonomously deciding when to respond, stay silent, or delegate complex tasks. It prioritizes vision as the primary driver for decision-making in physical-world scenarios like elderly fall detection, live sports commentary, or warehouse monitoring. The 8-billion-parameter model is designed for practical deployment, running on a single RTX 3090 GPU with sub-second latency. Its "full-stack" open-source release includes the model, inference system, and dataset, aiming to catalyze a developer ecosystem. JD's strategy is underpinned by its vast operational footprint in retail, logistics, and healthcare, which provides crucial real-world data for training. The move signals a broader shift in AI competition from screen-based Q&A to active participation in the physical world.

Imagine this scene:

An elderly person living alone slips and falls in the living room, and the pain prevents them from calling for help. At this moment, the smart device on their person or a home camera "sees" the abnormality. Without waiting for any voice command, the AI actively sends an alert and quickly contacts family or emergency services.

Or, you are watching an intense soccer match. At the moment a crucial goal is scored, before you can even think to rewind and ask a question, AI glasses automatically provide you with slow-motion analysis and tactical insights.

These scenarios are no longer fantasies of the future, but real-world propositions that JoyAI-VL-Interaction, the world's first full-stack open-source visual-language interaction model just launched by JD.com, is attempting to solve.

Over the past two years, the capability boundaries of large language models have been continuously expanded, but the mainstream mode of interaction remains stuck in the "turn-based" logic of "user questions, model answers." It is efficient, but not reasonable in many scenarios. Many important events happen too fast for users to ask a question; and in many scenes, there is no opportunity for voice command at all.

This year, a judgment is becoming an industry consensus: AI is moving from "predicting the next token" to "predicting the next physical state." This also means that AI must evolve from being a passive information processor to an active participant.

Right at this juncture, JD.com open-sourced JoyAI-VL-Interaction. This is the world's first full-stack open-source real-time visual-language interaction model, capable of autonomously judging when to respond, when to stay silent, and when to hand off complex tasks to backend models within continuous video streams.

What JoyAI-VL-Interaction aims to prove is: AI that truly enters the physical world should not always wait to be asked. It should learn to see, actively judge, and provide help at the right moment.

This is also the larger signal released by JD AI: from model capability to industrial application, AI competition is moving from on-screen Q&A to the real world.

Why Visual-Language Interaction?

In the real physical world, a vast amount of critical information occurs at moments when users don't have time to ask a question. This sense of "no time to react" is partly an experience issue, but more often, it's a capability boundary problem caused by the model paradigm.

The industry is not unaware of this limitation.

In the first half of 2026, real-time interaction became the hottest keyword in multimodal AI. The industry has broadly advanced along two paths: one is making turn-based dialogue faster, and the other is making voice calls more natural.

The former emphasizes low latency or arbitrary input/output, but its core remains "it answers only when you ask"; the latter allows the model to listen and speak simultaneously, be interrupted at any time, making the experience closer to a real human call, but the focus is still on voice scenarios.

The problem is that a large number of changes in the real world do not manifest as a sentence first. Fire, falls, approaching vehicles, changes in screen content, production line anomalies—these are all visuals that appear before language. If AI can only wait for someone to speak, it's hard to truly be "present."

The one who truly made the same judgment as JD.com at almost the same time is Thinking Machines Lab, founded by Mira Murati. On May 11, the company introduced the concept of "interaction models" and released some research preview demos, pointing out that the autonomous response paradigm of interaction models holds greater potential for Human-AI collaboration compared to the traditional Q&A paradigm.

The fact that two teams converged on the same line of thinking at nearly the same time is itself a signal: scaling interactivity as an inherent capability of the model is a direction the industry cannot avoid in the coming years.

The difference is that JD.com placed visual-language at a more central position, treating speech as a pluggable I/O, and making visual-language the "first-class driving modality" for the model's autonomous decision-making.

In other words, from the moment the camera turns on, JoyAI-VL-Interaction continuously "watches" the visual changes in the physical world and autonomously decides whether to speak up, what to say, and whether to hand off tasks.

This is also where the imagination for visual interaction lies: it can be used in scenarios like elderly and childcare, assistance for the blind, AI glasses, event commentary, store inspections, warehouse logistics, and robot collaboration. Users don't need to first formulate a problem into a sentence; AI can capture the need from environmental changes.

Therefore, vision is not just another input method; it is an indispensable perceptual channel for AI to move toward "predicting the next physical state."

JD.com's technical report on JoyAI-VL-Interaction also reinforces this point. The report shows that in six real-world streaming scenarios, JoyAI-VL-Interaction achieved a win rate of 77.6% against leading domestic models and 87.9% against foreign models. In the surveillance and warning scenario, which most tests event capture ability, the win rate reached 100%. The report suggests the gap is not merely in answer quality, but in the ability to act at the right moment.

However, achieving proactive visual interaction is indeed more difficult.

Data acquisition for voice interaction is relatively straightforward. A large amount of voice command datasets allows models to learn when humans speak, how to interrupt, and how to respond. The data needed for visual interaction is completely different. The model must learn, from a continuous stream of changing visuals, what signals merit a response and what signals should be met with silence.

A deeper barrier is the ability to define scenarios. In scenarios, voice interaction has a natural trigger boundary—the user opening their mouth to speak marks the start of interaction. Visual interaction has no clear start or end; the model must judge the boundaries within an unbounded stream of information.

This is also where JD.com's uniqueness lies: the company does not search for scenarios from an abstract laboratory; it naturally operates within real business networks spanning retail, logistics, health, industry, and more.

This means JD AI is not facing a single chat interface but a massive number of real-world tasks: how goods move, how equipment coordinates, how robots cooperate with humans, how anomalies are detected in advance. Models can learn from real needs and iterate based on real feedback.

Despite trade-offs in technical routes, the interactive form of future general AGI will inevitably be proactive intelligence. Intelligent agents must possess a complete loop of environmental perception, autonomous decision-making, and real-time response. Therefore, many companies are not unwilling to build visual interaction models; it's just that the soil to nurture visual interaction is currently lacking for them. This is also why capital and computing power first surged into the voice interaction track.

Thus, JD.com's choice to start from vision is not merely a technical route selection; it is also dictated by its strategic position. Compared to many LLM players, JD.com is closer to the operational front lines of the physical world and also has a greater need for AI that can actively perceive and respond in real-time.

To make this day come sooner, someone needs to set out earlier.

Lightweight, Open-Source, Deployable

What does being the world's first full-stack open-source model mean?

Redefining the interaction paradigm sounds grand, but when it comes to real-world applications, the first hurdle is quite simple: AI cannot always disturb people, nor can it remain silent when a reminder is needed.

People typically expect AI to be as talkative as possible, but in real-time visual interaction scenarios, a model that constantly interrupts is not smart. The truly valuable capability is actively appearing at critical moments and staying quiet during irrelevant times.

Therefore, JoyAI-VL-Interaction trains "silence" as an ability as well. The model needs to master three layers of judgment: in what scenarios it should proactively respond, in what scenarios it should remain silent, and in what scenarios it should delegate tasks out to other models.

This set of capabilities is of limited value if it stays only in research papers. JD.com's emphasis on "full-stack open-source" is key because it opens up the model, inference system, and application building path together, allowing developers to truly run, modify, and use it.

JD.com has chosen an engineering route that facilitates broader diffusion: an 8B parameter model, deployable on a single RTX 3090 graphics card. At this parameter scale, individual developers can run it, consumer-grade hardware can support it, and edge devices can implement it.

For real-time visual interaction, this lightweight approach does not mean reduced capability, but rather clearer division of labor.

JoyAI-VL-Interaction acts more like a front-end interaction layer, responsible for seeing the environment, judging timing, and completing brief communication. When encountering complex tasks requiring deep reasoning, it automatically delegates them to backend agents selected by the user, such as OpenClaw, Codex, or Claude Code. Therefore, an 8B model is sufficient.

For example, the model can first tell the user, "Let me think about that," then hand the difficult problem to the backend while itself remaining present; after the backend returns a result, it can synchronize the answer to the user. During this process, it can also continue helping the user with other immediate interactions.

JD.com has also adopted a lightweight design in the underlying system: through video encoding, long-range memory, and context compression, the model can continuously watch long video streams at a lower cost and control end-to-end latency to sub-second levels. For the average reader, the focus is not on these technical terms, but the result: AI can stay in real-world scenarios for longer and with a lower barrier to entry.

A cost-effective, deployable choice also directly leads to JD.com's open-source strategy. Only when the model is sufficiently lightweight, the system sufficiently complete, and the deployment barrier sufficiently low can real-time visual interaction potentially evolve from experiments by a few teams to an application ecosystem explored by more developers and enterprises together.

JD.com has already open-sourced this inference system, with a clear goal: to enable anyone with an RTX 3090 or higher graphics card and a camera to quickly set up their own real-time visual interaction application.

JoyAI-VL-Interaction has received day-0 support from vLLM-Omni and has been natively merged into the vLLM-Omni mainline.

Bringing AI Back to the Physical World

The purpose of open-sourcing is to hand over the imaginative application possibilities to a larger market. Because the value of technological breakthroughs ultimately must be tested by the real world.

The first batch of application ideas for JoyAI-VL-Interaction is already quite intuitive: during live sports broadcasts, AI can automatically provide commentary at the moment of a key goal or last-minute play; in stock monitoring, it can continuously watch screen changes and alert to anomalies; in home care, it can actively warn when an elderly person falls or a child approaches a dangerous area; paired with AI glasses, it can help users recognize roads, products, screens, and surroundings; when assisting the blind, it can convert visual information into real-time assistance.

For JD.com, an even greater expectation is its application in robotics: a model that understands when to speak, when to be silent, and when to ask a backend system for help can make robots more efficient and closer to the "tactful" intelligent assistants people expect.

The fundamental reason JD.com dares to "stir" this field at this point is that it holds physical world data assets that other LLM players lack.

Placed within the industry coordinates of 2026, the weight of physical world data assets is particularly significant.

2026 has been dubbed the "Year One of Embodied Intelligence Data" by the industry. Within this grand backdrop, a sharp contradiction exists: high-quality physical interaction data is extremely scarce, far from meeting the needs of large-scale training. The bottleneck for algorithmic iteration is shifting comprehensively from the model side to the data side.

At this point in time, JD.com announced its plan to accumulate 10 million hours of high-quality real-world scene video data within two years, mobilizing 600,000 people to participate in collection.

JD.com has over 3,000 real business scenarios covering retail, logistics, health, industry, and more. This year, it also innovated a community grid collection model in Suqian, deploying its self-developed JoyEgoCam head-mounted terminals in batches and mobilizing surrounding small and medium-sized enterprises and residents to collect data in real work scenarios.

The deployment speed is rapid. In March, JD.com announced the completion of the world's first embodied intelligence data collection center in Suqian. In April, it released the industry's first embodied data infrastructure covering the entire chain of collection, storage, labeling, training, evaluation, simulation, and testing. In May, JoyEgoCam achieved mass production, enabling continuous first-person perspective data collection.

This data is the most scarce fuel for training embodied models and visual interaction models. As embodied data joins the training, the value of JoyAI-VL-Interaction will further evolve from "a model that can actively see" to more concrete physical spaces like robots, unmanned vehicles, warehouses, stores, and homes.

Between models and applications, JoyAI-Echo, also open-sourced by JD.com on June 3rd, plays a key role. Echo excels at real-time generation from long videos, while Interaction excels at real-time understanding and interaction. Releasing two models within a month signifies that JD.com has connected both the input and output ends of video multimodality and placed the advancement of AI into the physical world in a longer-term position.

At the 618 kickoff press conference this year, JD.com stated its ambition to become "the world's largest physical world operation center."

In the era of human-computer interaction, the industry is increasingly focusing on how AI understands the physical world. JD.com's problem-solving logic differs from that of most LLM players: this company already operates within the physical world.

Warehousing, delivery, retail, health, and industry—all are training grounds and proving grounds for AI and embodied intelligence. Within JD Logistics alone, there are plans to deploy 3 million robots, 1 million unmanned vehicles, and 100,000 drones over the next five years. These hardware will also become platforms for JoyAI-VL-Interaction to demonstrate its utility.

Whether voice or vision, interaction models are essentially about connecting the physical and digital worlds, understanding the physical world, and orchestrating the digital world.

Open-sourcing is the first window JD.com opens outward. In this track where demand drives technology, by releasing the model, training data, and complete system together, JD.com is betting on a longer-term vision: transforming proactive interaction from a judgment by a few teams into a main channel for AI's advancement into the physical world.

You are welcome to launch the service with one click in vLLM-Omni, or start it locally with one click from the repository:

Code Repository: https://github.com/jd-opensource/JoyAI-VL-Interaction

Model Hub: https://huggingface.co/jdopensource/JoyAI-VL-Interaction-Preview

Dataset Hub: https://huggingface.co/datasets/jdopensource/JoyAI-VL-Interaction

Technical Report: https://huggingface.co/papers/2606.14777

Trending Cryptos

Related Questions

QWhat is the name and key innovation of the AI model recently open-sourced by JD.com?

AThe model is named JoyAI-VL-Interaction. Its key innovation is being the world's first fully open-source real-time vision-language interaction model, capable of autonomously deciding when to respond, remain silent, or delegate tasks while processing continuous video streams.

QAccording to the article, what is the main limitation of the current dominant 'turn-based' AI interaction paradigm?

AThe main limitation is that it requires a user to actively ask a question for the AI to respond ('user questions, model answers'). This is inefficient for fast-occurring events where there is no time to formulate a question and for scenarios where voice commands are impossible.

QHow does JD.com's JoyAI-VL-Interaction differ in its core approach compared to the concept proposed by Mira Murati's Thinking Machines Lab?

AWhile both converged on the concept of 'interaction models,' JD.com's model places vision-language at its core as the 'first-class driving modality' for autonomous decision-making. In contrast, Thinking Machines Lab's research preview focused more broadly on the interactive model paradigm without specifying vision as the primary driver.

QWhat are the practical deployment advantages of the JoyAI-VL-Interaction model as highlighted in the article?

AThe model is designed for practical deployment with an 8B parameter size, allowing it to run on a single NVIDIA 3090 GPU. It features a lightweight system with video encoding, long-term memory, and context compression for low-latency, long-duration operation on video streams, making it accessible for developers and edge devices.

QWhat long-term strategic asset does JD.com possess that supports its push into AI for the physical world, as mentioned in the article?

AJD.com possesses vast 'physical world data assets' from its real-world operations across retail, logistics, health, and industry. It is actively building a massive dataset, aiming to collect 10 million hours of high-quality real-scene video data, which is crucial for training embodied and vision-interaction AI models.

Related Reads

IOSG Founder: Ethereum Doesn't Need Another Leap of Technical Faith, It Needs a Musk-style Compromise

Jocy, founder of IOSG Ventures, argues that Ethereum does not need renewed technological faith but a "Musk-like compromise." The recent formation of ETHLabs—funded by major ETH holders like BitMine and Lubin—highlights a market-driven move to fill a gap left by the Ethereum Foundation (EF), signaling a loss of confidence in its decentralized, hands-off approach. The core critique contrasts Vitalik Buterin's (V) idealistic, technology-first vision with Elon Musk's pragmatic, business-driven execution. The author asserts Ethereum's current shortage is not another technical roadmap but a clear, real-world application narrative and a leader willing to engage directly with commercial realities—like Musk. Internal issues are emphasized, citing EF's management problems and talent drain. While the new decentralized model with independent nodes like ETHLabs addresses the single foundation's limitations, it risks fragmentation without cohesive direction. True cohesion, the author suggests, must come from a shared, compelling narrative around ETH's value, not just from aligned financial interests. Independence claims for new entities are seen as aspirational, needing years of transparency to build trust. The ultimate threat is not competitors like Solana, but the broader shift of attention and talent toward AI. Ethereum has a limited window—12 to 18 months—to recapture focus by delivering tangible, real-world applications. The conclusion urges V to shift from abstract ideals to grounded, pragmatic leadership. The time for this crucial pivot is running out.

marsbit59m ago

IOSG Founder: Ethereum Doesn't Need Another Leap of Technical Faith, It Needs a Musk-style Compromise

marsbit59m ago

Google Starts Selling TPUs, Big Tech Aims to Produce "Low-Cost Tokens" with AI Chips

Google has begun selling its proprietary TPU chips and AI computing hardware directly to third-party data centers and clients, marking a strategic shift. Previously only accessible via cloud rentals, TPUs are specialized processors designed for the matrix and tensor operations central to AI models. By combining thousands into supercomputing clusters managed by CPUs, Google achieves high-efficiency AI processing. This move enables Google’s Gemini AI to offer competitive token pricing, challenging rivals like OpenAI. It also signals a broader industry trend where AI compute is becoming a commoditized resource like electricity. While NVIDIA remains dominant with its CUDA ecosystem and high-performance GPUs, the focus is shifting from raw power to cost efficiency and system integration. Google’s approach mirrors NVIDIA’s by selling an entire ecosystem—hardware, software, and data center expertise—rather than just chips. This threatens NVIDIA’s grip on the mid-range inference market, where lower-cost, efficient solutions are increasingly demanded. Similarly, cloud providers like Huawei Cloud and Alibaba Cloud in China are developing their own AI chip ecosystems (e.g., Ascend, Zhenwu), packaging chips, clusters, and tools into full-stack solutions. They aim to reduce token costs and capture market share through integrated systems. In summary, the AI infrastructure race is evolving from a competition for the strongest chips to a contest for the most efficient and cost-effective systems. Google’s TPU sales highlight this transition, emphasizing that future success lies in delivering affordable, scalable AI compute as a foundational service.

marsbit1h ago

Google Starts Selling TPUs, Big Tech Aims to Produce "Low-Cost Tokens" with AI Chips

marsbit1h ago

Trading

Spot
Futures

Hot Articles

How to Buy MIRA

Welcome to HTX.com! We've made purchasing Mira (MIRA) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Mira (MIRA) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Mira (MIRA)After purchasing your Mira (MIRA), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Mira (MIRA)Easily trade Mira (MIRA) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

2.5k Total ViewsPublished 2025.09.25Updated 2026.06.02

How to Buy MIRA

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of MIRA (MIRA) are presented below.

活动图片