For the First Time, Pure Human Video Pretrained VLA for Dexterous Manipulation: Deployable with Minimal Fine-Tuning Data

marsbitОпубликовано 2026-06-08Обновлено 2026-06-08

Введение

For the first time, a purely human-video-pretrained Vision-Language-Action (VLA) model for dexterous manipulation requires only a small amount of data for fine-tuning to achieve successful real-world deployment. Achieving human-level dexterous manipulation remains a core challenge in robotics. While multi-fingered hands offer hardware potential, Visual-Language-Action (VLA) models lag behind due to the high cost of collecting diverse, high-quality robot data. A novel framework, VITRA, developed by Microsoft Research Asia and Tsinghua University, addresses this by automatically transforming massive, unlabeled real-world human activity videos into a structured V-L-A training dataset. Key innovations include precise 3D hand motion annotation from monocular video, atomic action segmentation based on hand-speed minima, and automated instruction generation using VLMs combined with 3D trajectory visualization. This process created a massive dataset of 1 million clips. Pretrained exclusively on this human video data, the VLA model (combining a VLM backbone with a Diffusion Transformer action expert) demonstrates strong zero-shot hand motion prediction in unseen environments. Crucially, it requires minimal fine-tuning (~1.2k demonstrations) on real robot data to achieve high-success-rate dexterous manipulation tasks like grasping, placing, pouring, and sweeping on hardware like the Realman robot with the XHAND1 dexterous hand. The model shows exceptional generalization to novel obje...

Achieving human-level dexterous manipulation capability has long been a core challenge in the field of robotics.

Although multi-fingered dexterous hands possess hardware potential similar to humans, due to the high cost of acquiring high-quality robotic action data, existing Vision-Language-Action (VLA) models lag far behind large language models (LLMs) and vision-language models (VLMs) in terms of data scale and diversity, making it difficult to meet the demands of complex tasks in the real world.

A recent research paper titled "Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos" from Microsoft Research Asia (MSRA) in collaboration with Tsinghua University addresses this critical issue by proposing an innovative pre-training framework called VITRA.

The core contribution of this research lies in proposing a fully automated solution that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with the existing V-L-A training data format for robots.

By extracting 3D hand motion trajectories from videos, performing atomic-level action segmentation, and automatically generating language instructions, the research team constructed a large-scale hand V-L-A dataset containing 1 million clips and 26 million frames.

After pre-training solely on human video data, the model demonstrated powerful zero-shot hand action prediction capabilities in completely unseen real-world environments.

With only a small amount of fine-tuning using real robot data, it achieved high success rates in dexterous manipulation on real robots and exhibited strong generalization ability to new objects and environments.

More details follow below.

Bridging the Gap from Human Videos to Robot Data

The central challenge of the paper is how to overcome the vast difference between unstructured human videos and structured robot data, thereby extracting high-quality action labels and language instructions usable for VLA model pre-training.

This research built a complete system consisting of three core technologies, achieving seamless transformation from raw video to V-L-A data.

△

3D Motion Annotation: Accurately Recovering Hand and Camera Trajectories

Recovering precise 3D hand motion from monocular, uncalibrated, and potentially moving camera videos is an extremely challenging task.

This research proposes a monocular camera and hand pose tracking method based on the latest 3D vision technology:

First, it determines the camera state via background optical flow and estimates camera intrinsics.

Subsequently, it tracks camera pose using a visual SLAM method and a depth estimation model, and extracts the camera-space 3D hand pose per frame (including wrist 6D pose and full joint angles) using a hand reconstruction model.

Finally, by combining this information, it obtains the 3D hand motion trajectory in world space.

This method not only provides high-precision action labels but also lays the foundation for subsequent action segmentation and instruction annotation.

Atomic-Level Action Segmentation: Natural Segmentation Based on Velocity Minima

Existing robot V-L-A data typically consists of simple, short-horizon atomic-level tasks. Accurately segmenting these atomic actions from long videos is a difficult problem.

Inspired by the natural rhythm of human movements, the research team proposed a simple yet efficient segmentation algorithm: segmenting based on the minima of hand movement speed in 3D space.

During action transitions, human hands typically exhibit changes in speed, and speed minima often mark the switching of actions.

By detecting the speed minima of the 3D wrist trajectory in world space, this method can efficiently segment long videos into short clips containing a single atomic action, without requiring any additional manual annotation or model inference.

Instruction Annotation: Precise Action Description Combining 3D Trajectories

To generate accurate language instructions for the segmented video clips, the research team cleverly combined a Vision-Language Model (VLM) with 3D hand trajectories.

For each video clip, the system uniformly samples 8 frames and projects and overlays the palm's 3D trajectory onto these images.

Then, these images with trajectory highlighting are input into GPT-4, prompting it to describe the specified hand's action in an imperative sentence form, combining image content and trajectory information.

Experiments proved that providing atomic-level video clips overlaid with 3D hand trajectories significantly improves the accuracy of GPT-generated action descriptions.

Achieving Powerful Zero-Shot Prediction and Real-World Generalization

Based on the automatically constructed large-scale human hand V-L-A dataset, the research team designed and trained a VLA model specifically tailored for dexterous manipulation.

△

1. Model Architecture: Combining VLM and Diffusion Action Experts

This VLA model consists of a VLM backbone network (PaliGemma-2) and a diffusion action expert (Diffusion Transformer, DiT).

The VLM receives visual observations, language instructions, and camera field-of-view (FoV) information, outputting a "Cognition Feature".

The diffusion action expert receives this cognition feature, the current hand state, and a noisy action block with masking, iteratively denoising to predict future hand action sequences.

To handle fast-moving human hand actions and adapt to short-clip data, the model employs a Causal Attention mechanism for action denoising, ensuring each action step's prediction depends only on previous actions, effectively mitigating negative impacts from zero-padding.

2. Zero-Shot Hand Action Prediction: Demonstrating Remarkable Ability in Unseen Environments

In completely unseen real-life environments, the pre-trained model demonstrated powerful zero-shot hand action prediction capabilities.

△

In evaluations for grasping tasks and general action prediction tasks, this model significantly outperformed models trained on data collected in lab environments (like EgoDex), and also outperformed models trained using raw human-annotated data.

This fully demonstrates that pre-training with massive, diverse real-life videos can greatly enhance the model's generalization ability for complex environments and unknown objects.

3. Real-Robot Dexterous Manipulation: Efficient Deployment with Minimal Fine-Tuning Data

To deploy on real robots, the research team aligned the human hand's action space with that of the robot dexterous hand (e.g., Realman robot equipped with StarXHand1).

△

With only a small amount (about 1.2K instances) of fine-tuning using real robot teleoperation data on the pre-trained model, it could execute various dexterous manipulation tasks in the real world, including grasping, placing, pouring, and sweeping.

Experimental results show that compared to models not pre-trained on human VLA data or pre-trained on other datasets (like OXE, EgoDex), this method achieved significant improvement in task success rate, especially demonstrating remarkable robustness when facing unseen objects and backgrounds.

The Hardware Core Supporting VITRA's Real-World Deployment

The reason the VITRA framework can achieve stunning generalization on real robots relies not only on algorithmic innovations but also on the powerful support of the underlying hardware—

the domestically pioneered, fully direct-drive five-fingered dexterous hand StarXHand1 developed by StarX Robotics.

This framework forms a perfect "software-hardware synergy" with the hardware characteristics of StarXHand1, demonstrating irreplaceable deployment advantages in practical application scenarios.

△

High-Precision URDF and Seamless Connection to Human Hand Action Space

The core breakthrough of the VITRA framework lies in aligning the human hand action space with the robot dexterous hand's action space.

StarXHand1 officially provides an extremely high-precision URDF model, which not only accurately describes motion and dynamics parameters but also perfectly maps the spatial distribution of human hand joints.

This "digital twin"-level model support enables VITRA to precisely map human joint angles to the corresponding joints of StarXHand1 during the fine-tuning phase, thereby significantly reducing the reality gap from human videos to real hardware and ensuring the efficient deployment of pre-trained strategies on real hardware.

Fully Direct-Drive Architecture and High-Frequency Response: Perfect Execution of Complex Dexterous Operations

When performing complex dexterous operations such as pouring and sweeping, robots require extremely high dynamic response capability.

The fully direct-drive motor architecture adopted by StarXHand1 provides the most ideal hardware foundation for this algorithm.

The fully direct-drive design fundamentally eliminates the significant friction, hysteresis, and nonlinear interference brought by traditional reducers, endowing the dexterous hand with super-sensitive dynamic response capabilities. This enables StarXHand1 to instantly and precisely execute the action commands output by the VITRA model, safely manipulating various unknown objects.

Rich Sensor Array: Reserving Space for Future Multimodal Perception

Although the current VITRA model primarily relies on visual input, the rich sensor array equipped on StarXHand1 (such as high-resolution tactile arrays) reserves vast space for future multimodal perception.

Combined with StarXHand1's powerful hardware perception capabilities, future VLA models are expected to further integrate tactile feedback, handling more delicate and complex "Finger Gaits" tasks.

The Scaling Law of Data Size

This research also explored the impact of pre-training data scale on model performance.

△

Experiments found that as the amount of pre-training data increased, the model's error in zero-shot hand action prediction tasks steadily decreased, and its success rate in real robot manipulation tasks continuously rose.

This clear scaling behavior indicates that by further expanding the scale of human video data, the performance of VLA models is expected to continuously improve.

This achievement marks a key breakthrough in utilizing unstructured human videos for robotic VLA model pre-training.

By providing a fully automated data conversion solution, this research significantly lowers the barrier to acquiring high-quality robot training data, paving the way for the application of multi-fingered dexterous hands in broader, more complex real-world scenarios, and laying a solid foundation for moving towards truly generalized embodied intelligence.

Paper link: https://arxiv.org/abs/2510.21571

This article is from WeChat Official Account "QbitAI", author: VITRA Team

Связанные с этим вопросы

QWhat is the main contribution of the VITRA framework introduced in the article?

AThe main contribution of VITRA is a fully automated framework that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with robot VLA training data formats. It creates a large-scale hand V-L-A dataset (1M clips, 26M frames) for pre-training, enabling models to achieve strong zero-shot prediction and, after minimal robot data fine-tuning, high success in real robot dexterous manipulation.

QWhat are the three core technical components for converting human videos into V-L-A data in the VITRA framework?

AThe three core technical components are: 1) 3D motion annotation for recovering precise hand and camera trajectories from monocular videos, 2) Atomic-level action segmentation based on velocity minima in 3D hand trajectories, and 3) Instruction annotation using VLM (like GPT-4) prompted with images and overlaid 3D hand trajectory to generate accurate action descriptions.

QHow does the pre-trained VLA model achieve zero-shot hand action prediction in unseen environments?

AThe VLA model, pre-trained on the large-scale human video dataset, shows strong zero-shot hand action prediction capabilities in completely unseen real-world environments. Its architecture combines a VLM backbone for processing visual input and instructions to output a cognition feature, and a diffusion action expert to predict future hand action sequences through iterative denoising, outperforming models trained on lab data or human-annotated datasets.

QHow is the VITRA framework deployed on a real robot for dexterous manipulation tasks?

ATo deploy on a real robot, the VITRA framework aligns the human hand action space with the robot hand's action space. After pre-training on human videos, the model is fine-tuned using a small amount of real robot teleoperation data (e.g., ~1.2K demos). This fine-tuned model can then execute various dexterous tasks like grasping, placing, pouring, and sweeping on the real robot with high success rates and strong generalization to new objects and backgrounds.

QWhat role does the Xingdong XHAND1 dexterous hand hardware play in the successful deployment of VITRA?

AThe Xingdong XHAND1 dexterous hand provides crucial hardware support for VITRA's deployment. Its high-precision URDF model enables seamless alignment between human and robot action spaces. Its full direct-drive architecture offers high-frequency dynamic response, perfectly executing complex VITRA commands. Its rich sensor array also leaves space for future multi-modal perception integration, forming a powerful 'software-hardware synergy' for real-world application.

Похожее

The Battle for the AI Payment Race: Traditional Card Networks Face Off Against Coinbase

With the rise of AI agents conducting transactions, a battle for the underlying payment infrastructure is underway. Two distinct and incompatible approaches have emerged for enabling autonomous AI payments. The first approach is championed by traditional card networks Visa and Mastercard. They leverage their existing tokenized card credential systems, extending them to allow verified AI agents to make purchases within user-defined limits. Services like Mastercard's Agent Pay and Visa's Intelligent Commerce integrate with major AI platforms (e.g., OpenAI, Anthropic) and keep transactions within the established, decades-old card payment model. This system offers advantages for consumer retail, including robust fraud protection, chargeback mechanisms, and extensive merchant networks. The second approach, led by Coinbase, utilizes stablecoins on open internet protocols. Its x402 protocol reactivates the HTTP 402 status code for machine-to-machine micropayments, using USDC for settlement directly on-chain. This method eliminates the need for accounts or card fees, making it highly efficient for high-frequency, low-value, cross-border transactions between AI agents—such as paying for API calls, data streams, or computational resources—where traditional card fees and settlement times are impractical. While card networks excel in consumer-facing scenarios requiring dispute resolution, stablecoin protocols are tailored for machine economies. A key challenge for both is agent identity verification and transaction authorization. Notably, Visa and Mastercard are hedging their bets by also investing in stablecoins. Visa has rapidly grown its stablecoin settlement volume and is collaborating with Coinbase to bridge its network with the x402 protocol. Mastercard plans to acquire stablecoin platform BVNK. Their strategy is to become the fee-collecting gateway for all payment flows, regardless of the channel. Current applications reflect this division: consumer AI shopping tools (e.g., ChatGPT's checkout, Amazon's "Shop for Me") predominantly use card networks, while machine-focused services (e.g., Amazon Bedrock's core payments) adopt stablecoins via the x402 protocol. In the short term, a coexistence model is expected, with cards dominating retail and stablecoins powering machine transactions. The long-term outcome depends on whether AI-driven commerce evolves to resemble traditional retail or becomes a vast network of machine micropayments. By investing in both tracks, the incumbent card networks are positioning themselves to capture transaction fees regardless of which future prevails.

marsbit6 мин. назад

The Battle for the AI Payment Race: Traditional Card Networks Face Off Against Coinbase

marsbit6 мин. назад

AI Deceives with Perfection: How Can Crypto Users Defend Against New Scams?

AI has made crypto scams more sophisticated by generating flawless text and realistic interfaces, rendering traditional detection methods like spotting typos and grammar errors obsolete. Scammers now use AI for polished phishing emails, fake customer service chats, and convincing websites. Crypto users face unique risks as blockchain transactions are irreversible, and attackers can steal assets simply by tricking users into authorizing malicious transactions. To defend against these advanced threats, users must adopt rigorous verification habits: - Carefully check website URLs for subtle spoofing. - Use only official links and channels. - Scrutinize all wallet permissions and token approvals before signing. - Verify contract addresses directly from trusted sources, not token names. - Ignore unsolicited private messages posing as customer support. - Treat urgent requests with extreme skepticism. The core principle is that a professional appearance no longer equals safety. In the AI era, security hinges on proactive verification of every link, transaction, and communication, not on trusting surface-level credibility.

marsbit25 мин. назад

AI Deceives with Perfection: How Can Crypto Users Defend Against New Scams?

marsbit25 мин. назад

Turn Off AI Before the Interview: What Kind of People is Anthropic Looking For?

"Shut Off AI Before the Interview: What Kind of People Anthropic Is Screening For" Anthropic, now the world's most valuable AI startup, has an exceptionally competitive hiring process. A key and distinctive element is its mandatory "culture interview," conducted without AI assistance. This interview focuses not on technical skills, but on assessing a candidate's core values, worldview, and perspective on long-term AI risks. Questions delve deeply into personal ethics, "unusual beliefs," and the candidate's willingness to thoughtfully critique Anthropic's own mission. The goal is to evaluate whether someone's convictions are genuinely their own and if they can articulate and defend them under pressure. A poor score here can veto otherwise successful technical interviews. This approach starkly contrasts with companies like Google, which now encourages using AI in interviews to assess "AI fluency." Anthropic's philosophy suggests that in an era where AI makes execution cheap and the generation of opinions nearly free, true scarcity lies in original, deeply held thought. The company seeks individuals who possess substance "after turning off the AI"—those who haven't outsourced their core thinking. This focus on aligning fundamental values contributes to Anthropic's high employee retention rate in a fast-moving industry.

marsbit28 мин. назад

Turn Off AI Before the Interview: What Kind of People is Anthropic Looking For?

marsbit28 мин. назад

In the Last 2 Minutes Before SK Hynix Market Open, TradeXYZ Achieved Price Accuracy Within 0.13%

Traditionally, asset price discovery halts when markets close. However, decentralized exchanges like Hyperliquid, enabling 24/7 trading of Real-World Asset (RWA) perpetuals, are changing this. A case study involving SK Hynix stock on Hyperliquid's xyz:SKHX market demonstrates this shift during a weekend when the Korean Exchange (KRX) was closed. While the KRX closed at 2,070,000 KRW on June 5th, active trading continued on-chain. By 08:56 KST on Monday, June 8th, just before the KRX open, the chain price had fallen to 1200.0 USDC, implying a -10.21% drop. Three minutes later, the KRX opened at 1,856,000 KRW, an actual drop of -10.34%. The on-chain price had predicted the opening decline with remarkable accuracy, missing by only 0.13 percentage points. In the final two minutes before the open (08:58-08:59 KST), the on-chain market saw a significant volume spike and a +2.31% price rebound. This wasn't a prediction error but likely the market front-running the expected post-open bounce. Indeed, after opening at its low, the KRX price rebounded approximately +2.64% by 09:03 KST. This event illustrates how 24/7 on-chain markets can act as a leading price discovery venue, not only anticipating opening prices but also trading the immediate post-open dynamics before traditional markets even begin.

marsbit39 мин. назад

In the Last 2 Minutes Before SK Hynix Market Open, TradeXYZ Achieved Price Accuracy Within 0.13%

marsbit39 мин. назад

Farewell to Traditional Bulls and Bears: The Market Has Entered an Era of Rotating Bubbles

Farewell to traditional bull and bear markets; we have entered an era of rolling bubbles. This article uses a meteorological analogy to explain the modern market's shift from slow-moving, long-term trends to a chain of rapid, successive speculative frenzies. The old market resembled "stratiform" weather—slow, broad cycles lasting years. Today's market is like a "mesoscale convective system," where isolated storms (bubbles in sectors like AI, GLP-1 drugs, or crypto) form in sequence. Each is triggered by the outflow of capital and sentiment from the previous one, creating a self-perpetuating chain of booms and busts. This structural change is driven by eight permanent shifts: the democratization of speculation (zero-commission trading, retail options activity), perpetual buying from defined-contribution retirement plans, the dominance of passive investing (creating price-insensitive flows), the rise of multi-strategy funds and high-frequency trading (weakening price discovery), suppressed volatility that erupts violently, an index composition now dominated by long-duration, narrative-driven tech stocks, the elimination of information delays, and a permissive fiscal/monetary backdrop. These conditions ensure that rolling bubbles are the new normal. To navigate this environment, investors should either become deep-sector experts who understand the underlying technologies and business models or become adept observers of trends and capital flows. While chaotic from within each "storm," a higher-altitude view reveals a predictable pattern of serial booms. The key is to avoid being emotionally swept up in any single narrative and to recognize the market's new, permanent structure.

marsbit40 мин. назад

Farewell to Traditional Bulls and Bears: The Market Has Entered an Era of Rotating Bubbles