For the First Time, Pure Human Video Pretrained VLA for Dexterous Manipulation: Deployable with Minimal Fine-Tuning Data

marsbitPublished on 2026-06-08Last updated on 2026-06-08

Abstract

For the first time, a purely human-video-pretrained Vision-Language-Action (VLA) model for dexterous manipulation requires only a small amount of data for fine-tuning to achieve successful real-world deployment. Achieving human-level dexterous manipulation remains a core challenge in robotics. While multi-fingered hands offer hardware potential, Visual-Language-Action (VLA) models lag behind due to the high cost of collecting diverse, high-quality robot data. A novel framework, VITRA, developed by Microsoft Research Asia and Tsinghua University, addresses this by automatically transforming massive, unlabeled real-world human activity videos into a structured V-L-A training dataset. Key innovations include precise 3D hand motion annotation from monocular video, atomic action segmentation based on hand-speed minima, and automated instruction generation using VLMs combined with 3D trajectory visualization. This process created a massive dataset of 1 million clips. Pretrained exclusively on this human video data, the VLA model (combining a VLM backbone with a Diffusion Transformer action expert) demonstrates strong zero-shot hand motion prediction in unseen environments. Crucially, it requires minimal fine-tuning (~1.2k demonstrations) on real robot data to achieve high-success-rate dexterous manipulation tasks like grasping, placing, pouring, and sweeping on hardware like the Realman robot with the XHAND1 dexterous hand. The model shows exceptional generalization to novel obje...

Achieving human-level dexterous manipulation capability has long been a core challenge in the field of robotics.

Although multi-fingered dexterous hands possess hardware potential similar to humans, due to the high cost of acquiring high-quality robotic action data, existing Vision-Language-Action (VLA) models lag far behind large language models (LLMs) and vision-language models (VLMs) in terms of data scale and diversity, making it difficult to meet the demands of complex tasks in the real world.

A recent research paper titled "Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos" from Microsoft Research Asia (MSRA) in collaboration with Tsinghua University addresses this critical issue by proposing an innovative pre-training framework called VITRA.

The core contribution of this research lies in proposing a fully automated solution that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with the existing V-L-A training data format for robots.

By extracting 3D hand motion trajectories from videos, performing atomic-level action segmentation, and automatically generating language instructions, the research team constructed a large-scale hand V-L-A dataset containing 1 million clips and 26 million frames.

After pre-training solely on human video data, the model demonstrated powerful zero-shot hand action prediction capabilities in completely unseen real-world environments.

With only a small amount of fine-tuning using real robot data, it achieved high success rates in dexterous manipulation on real robots and exhibited strong generalization ability to new objects and environments.

More details follow below.

Bridging the Gap from Human Videos to Robot Data

The central challenge of the paper is how to overcome the vast difference between unstructured human videos and structured robot data, thereby extracting high-quality action labels and language instructions usable for VLA model pre-training.

This research built a complete system consisting of three core technologies, achieving seamless transformation from raw video to V-L-A data.

3D Motion Annotation: Accurately Recovering Hand and Camera Trajectories

Recovering precise 3D hand motion from monocular, uncalibrated, and potentially moving camera videos is an extremely challenging task.

This research proposes a monocular camera and hand pose tracking method based on the latest 3D vision technology:

First, it determines the camera state via background optical flow and estimates camera intrinsics.

Subsequently, it tracks camera pose using a visual SLAM method and a depth estimation model, and extracts the camera-space 3D hand pose per frame (including wrist 6D pose and full joint angles) using a hand reconstruction model.

Finally, by combining this information, it obtains the 3D hand motion trajectory in world space.

This method not only provides high-precision action labels but also lays the foundation for subsequent action segmentation and instruction annotation.

Atomic-Level Action Segmentation: Natural Segmentation Based on Velocity Minima

Existing robot V-L-A data typically consists of simple, short-horizon atomic-level tasks. Accurately segmenting these atomic actions from long videos is a difficult problem.

Inspired by the natural rhythm of human movements, the research team proposed a simple yet efficient segmentation algorithm: segmenting based on the minima of hand movement speed in 3D space.

During action transitions, human hands typically exhibit changes in speed, and speed minima often mark the switching of actions.

By detecting the speed minima of the 3D wrist trajectory in world space, this method can efficiently segment long videos into short clips containing a single atomic action, without requiring any additional manual annotation or model inference.

Instruction Annotation: Precise Action Description Combining 3D Trajectories

To generate accurate language instructions for the segmented video clips, the research team cleverly combined a Vision-Language Model (VLM) with 3D hand trajectories.

For each video clip, the system uniformly samples 8 frames and projects and overlays the palm's 3D trajectory onto these images.

Then, these images with trajectory highlighting are input into GPT-4, prompting it to describe the specified hand's action in an imperative sentence form, combining image content and trajectory information.

Experiments proved that providing atomic-level video clips overlaid with 3D hand trajectories significantly improves the accuracy of GPT-generated action descriptions.

Achieving Powerful Zero-Shot Prediction and Real-World Generalization

Based on the automatically constructed large-scale human hand V-L-A dataset, the research team designed and trained a VLA model specifically tailored for dexterous manipulation.

1. Model Architecture: Combining VLM and Diffusion Action Experts

This VLA model consists of a VLM backbone network (PaliGemma-2) and a diffusion action expert (Diffusion Transformer, DiT).

The VLM receives visual observations, language instructions, and camera field-of-view (FoV) information, outputting a "Cognition Feature".

The diffusion action expert receives this cognition feature, the current hand state, and a noisy action block with masking, iteratively denoising to predict future hand action sequences.

To handle fast-moving human hand actions and adapt to short-clip data, the model employs a Causal Attention mechanism for action denoising, ensuring each action step's prediction depends only on previous actions, effectively mitigating negative impacts from zero-padding.

2. Zero-Shot Hand Action Prediction: Demonstrating Remarkable Ability in Unseen Environments

In completely unseen real-life environments, the pre-trained model demonstrated powerful zero-shot hand action prediction capabilities.

In evaluations for grasping tasks and general action prediction tasks, this model significantly outperformed models trained on data collected in lab environments (like EgoDex), and also outperformed models trained using raw human-annotated data.

This fully demonstrates that pre-training with massive, diverse real-life videos can greatly enhance the model's generalization ability for complex environments and unknown objects.

3. Real-Robot Dexterous Manipulation: Efficient Deployment with Minimal Fine-Tuning Data

To deploy on real robots, the research team aligned the human hand's action space with that of the robot dexterous hand (e.g., Realman robot equipped with StarXHand1).

With only a small amount (about 1.2K instances) of fine-tuning using real robot teleoperation data on the pre-trained model, it could execute various dexterous manipulation tasks in the real world, including grasping, placing, pouring, and sweeping.

Experimental results show that compared to models not pre-trained on human VLA data or pre-trained on other datasets (like OXE, EgoDex), this method achieved significant improvement in task success rate, especially demonstrating remarkable robustness when facing unseen objects and backgrounds.

The Hardware Core Supporting VITRA's Real-World Deployment

The reason the VITRA framework can achieve stunning generalization on real robots relies not only on algorithmic innovations but also on the powerful support of the underlying hardware—

the domestically pioneered, fully direct-drive five-fingered dexterous hand StarXHand1 developed by StarX Robotics.

This framework forms a perfect "software-hardware synergy" with the hardware characteristics of StarXHand1, demonstrating irreplaceable deployment advantages in practical application scenarios.

High-Precision URDF and Seamless Connection to Human Hand Action Space

The core breakthrough of the VITRA framework lies in aligning the human hand action space with the robot dexterous hand's action space.

StarXHand1 officially provides an extremely high-precision URDF model, which not only accurately describes motion and dynamics parameters but also perfectly maps the spatial distribution of human hand joints.

This "digital twin"-level model support enables VITRA to precisely map human joint angles to the corresponding joints of StarXHand1 during the fine-tuning phase, thereby significantly reducing the reality gap from human videos to real hardware and ensuring the efficient deployment of pre-trained strategies on real hardware.

Fully Direct-Drive Architecture and High-Frequency Response: Perfect Execution of Complex Dexterous Operations

When performing complex dexterous operations such as pouring and sweeping, robots require extremely high dynamic response capability.

The fully direct-drive motor architecture adopted by StarXHand1 provides the most ideal hardware foundation for this algorithm.

The fully direct-drive design fundamentally eliminates the significant friction, hysteresis, and nonlinear interference brought by traditional reducers, endowing the dexterous hand with super-sensitive dynamic response capabilities. This enables StarXHand1 to instantly and precisely execute the action commands output by the VITRA model, safely manipulating various unknown objects.

Rich Sensor Array: Reserving Space for Future Multimodal Perception

Although the current VITRA model primarily relies on visual input, the rich sensor array equipped on StarXHand1 (such as high-resolution tactile arrays) reserves vast space for future multimodal perception.

Combined with StarXHand1's powerful hardware perception capabilities, future VLA models are expected to further integrate tactile feedback, handling more delicate and complex "Finger Gaits" tasks.

The Scaling Law of Data Size

This research also explored the impact of pre-training data scale on model performance.

Experiments found that as the amount of pre-training data increased, the model's error in zero-shot hand action prediction tasks steadily decreased, and its success rate in real robot manipulation tasks continuously rose.

This clear scaling behavior indicates that by further expanding the scale of human video data, the performance of VLA models is expected to continuously improve.

This achievement marks a key breakthrough in utilizing unstructured human videos for robotic VLA model pre-training.

By providing a fully automated data conversion solution, this research significantly lowers the barrier to acquiring high-quality robot training data, paving the way for the application of multi-fingered dexterous hands in broader, more complex real-world scenarios, and laying a solid foundation for moving towards truly generalized embodied intelligence.

Paper link: https://arxiv.org/abs/2510.21571

This article is from WeChat Official Account "QbitAI", author: VITRA Team

Related Questions

QWhat is the main contribution of the VITRA framework introduced in the article?

AThe main contribution of VITRA is a fully automated framework that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with robot VLA training data formats. It creates a large-scale hand V-L-A dataset (1M clips, 26M frames) for pre-training, enabling models to achieve strong zero-shot prediction and, after minimal robot data fine-tuning, high success in real robot dexterous manipulation.

QWhat are the three core technical components for converting human videos into V-L-A data in the VITRA framework?

AThe three core technical components are: 1) 3D motion annotation for recovering precise hand and camera trajectories from monocular videos, 2) Atomic-level action segmentation based on velocity minima in 3D hand trajectories, and 3) Instruction annotation using VLM (like GPT-4) prompted with images and overlaid 3D hand trajectory to generate accurate action descriptions.

QHow does the pre-trained VLA model achieve zero-shot hand action prediction in unseen environments?

AThe VLA model, pre-trained on the large-scale human video dataset, shows strong zero-shot hand action prediction capabilities in completely unseen real-world environments. Its architecture combines a VLM backbone for processing visual input and instructions to output a cognition feature, and a diffusion action expert to predict future hand action sequences through iterative denoising, outperforming models trained on lab data or human-annotated datasets.

QHow is the VITRA framework deployed on a real robot for dexterous manipulation tasks?

ATo deploy on a real robot, the VITRA framework aligns the human hand action space with the robot hand's action space. After pre-training on human videos, the model is fine-tuned using a small amount of real robot teleoperation data (e.g., ~1.2K demos). This fine-tuned model can then execute various dexterous tasks like grasping, placing, pouring, and sweeping on the real robot with high success rates and strong generalization to new objects and backgrounds.

QWhat role does the Xingdong XHAND1 dexterous hand hardware play in the successful deployment of VITRA?

AThe Xingdong XHAND1 dexterous hand provides crucial hardware support for VITRA's deployment. Its high-precision URDF model enables seamless alignment between human and robot action spaces. Its full direct-drive architecture offers high-frequency dynamic response, perfectly executing complex VITRA commands. Its rich sensor array also leaves space for future multi-modal perception integration, forming a powerful 'software-hardware synergy' for real-world application.

Related Reads

Elderly Borrow Money to Trade Stocks, Entire Nation Adds Leverage: 'Ant Army' Panics as South Korean Stock Market Plunges

Titled "Panic Among 'Ant Army' as South Korean Stocks Plunge After Elders Borrow to Invest, Everyone Leverages Up," this article details a dramatic reversal in South Korea's red-hot stock market. After a sustained rally toward 9,000 points driven by AI semiconductor hype, the KOSPI index recently crashed, triggering circuit breakers. The sell-off was led by major chipmakers Samsung Electronics and SK Hynix, whose combined weight in the index is over 50%. The plunge exposed the extreme leverage and speculative behavior that fueled the boom. Individual investors, dubbed the "ant army," had borrowed heavily or used leverage ETFs to chase gains, with trading accounts outnumbering the population. A significant portion of this leveraged money came from older citizens, some of whom reportedly cashed out insurance policies to invest. ETF trading became dominated (over 90%) by high-risk leveraged and inverse products. The correction was triggered by a pullback in U.S. tech stocks, leading to a foreign capital exodus and a weakening Korean won, creating a vicious cycle. While President Lee Jae-myung attempted to reassure markets and NVIDIA's CEO signaled support during a visit, officials like Finance Minister Ju Yeong-geun expressed concern over the dangerous "herd mentality." The article highlights a pervasive, high-risk investment culture where everyone from office workers to retirees and even parents opening accounts for newborns sought quick profits, largely concentrated in a few tech stocks, setting the stage for a sharp and painful correction.

marsbit4m ago

Elderly Borrow Money to Trade Stocks, Entire Nation Adds Leverage: 'Ant Army' Panics as South Korean Stock Market Plunges

marsbit4m ago

From Hunyuan to WeChat AI: Tencent's Slow Paced Journey Reaches the Delivery Juncture

On June 8, 2026, WeChat's developer platform announced the internal testing of "WeChat AI," an AI assistant integrated into the WeChat ecosystem. It allows users to invoke, access, and operate Mini Programs through natural language conversation. The platform offers two access modes: an "Automatic Mode" where developers authorize platform access to their source code for zero-configuration AI operation, and a "Developer Mode" for building custom skills. While the name "WeChat AI" is provisional, this marks WeChat's first step in opening its vast Mini Program ecosystem—comprising over 400,000 developers and hundreds of millions of daily active users—to AI-driven conversational interaction. This move represents the latest step in Tencent's deliberate AI strategy, moving from technical R&D and standalone product validation to integration within its super-app. The underlying foundation is Tencent's self-developed Hunyuan large language model. Ranked first domestically in application-oriented capabilities like Agent task execution in 2025, Hunyuan's focus on stability and precision over raw parameter count aligns with WeChat AI's need for reliable, low-latency operations involving sensitive tasks like payments and bookings. Prior C-side validation came from "Yuanbao," a standalone AI app whose Monthly Active Users (MAU) surpassed 114 million during the 2026 Chinese New Year红包 campaign, though daily activity later subsided. This "pulse growth" highlighted the challenge of user retention for standalone apps, informing the decision to integrate AI natively into WeChat's high-frequency scenarios. However, WeChat AI's "Automatic Mode," which requires source code access, raises developer concerns about code security, data visibility, and liability for AI errors. A deeper, ecosystem-level tension exists between the efficiency of centralized AI task调度 and the potential "short-circuiting" of merchant pages, which could erode their branding, advertising revenue, and user engagement. As Tencent Chairman Pony Ma noted, balancing centralized AI调度 with the protection of decentralized merchant traffic is a core challenge. In summary, Tencent's AI path—comprising the stable Hunyuan base model, the user-validated Yuanbao app, and the newly testing WeChat AI integration—is logically coherent. The success of WeChat AI now hinges on resolving developer trust, establishing fair ecosystem rules for merchants, and ensuring operational reliability to gain user confidence for deep, transactional use.

marsbit5m ago

From Hunyuan to WeChat AI: Tencent's Slow Paced Journey Reaches the Delivery Juncture

marsbit5m ago

STRC Briefly Fell Below $91: Will Strategy Be Hunted by 'Market Fear'?

The article draws a parallel between FTX's 2022 collapse and the current situation facing MicroStrategy (Strategy), a major corporate holder of Bitcoin. The author argues that MicroStrategy's financial model, heavily reliant on issuing equity and convertible debt at a premium to its Bitcoin holdings, is under stress. The core issue is the compression of MSTR's stock premium over its Bitcoin holdings (NAV). This erodes the viability of its "flywheel" – using equity sales to buy more Bitcoin. The company has shifted towards preferred shares (like STRC) and debt to raise capital, incurring significant dividend and interest obligations (approximately $1.7 billion annually). With cash reserves dwindling and debt maturities looming, MicroStrategy faces mounting pressure to generate cash. The article outlines three problematic options: 1) cutting preferred dividends, damaging investor confidence; 2) issuing more MSTR stock at low premiums, diluting existing shareholders; or 3) selling Bitcoin, which founder Michael Saylor had vowed against but recently did in a small symbolic transaction. The author suggests that, like FTX, a crisis of confidence could trigger a rapid downward spiral as investors flee. While noting Saylor's actions are legal—unlike SBF's fraud at FTX—the article warns the structural risk born from financial engineering and over-leverage is significant. The preferred path out is a sharp rise in Bitcoin's price to restart the premium flywheel, but this would only create a larger, more complex system vulnerable to future failure. The author concludes by advocating for direct Bitcoin ownership over exposure through MicroStrategy's increasingly risky financial structure.

Foresight News20m ago

STRC Briefly Fell Below $91: Will Strategy Be Hunted by 'Market Fear'?

Foresight News20m ago

The Battle for the AI Payment Race: Traditional Card Networks Face Off Against Coinbase

With the rise of AI agents conducting transactions, a battle for the underlying payment infrastructure is underway. Two distinct and incompatible approaches have emerged for enabling autonomous AI payments. The first approach is championed by traditional card networks Visa and Mastercard. They leverage their existing tokenized card credential systems, extending them to allow verified AI agents to make purchases within user-defined limits. Services like Mastercard's Agent Pay and Visa's Intelligent Commerce integrate with major AI platforms (e.g., OpenAI, Anthropic) and keep transactions within the established, decades-old card payment model. This system offers advantages for consumer retail, including robust fraud protection, chargeback mechanisms, and extensive merchant networks. The second approach, led by Coinbase, utilizes stablecoins on open internet protocols. Its x402 protocol reactivates the HTTP 402 status code for machine-to-machine micropayments, using USDC for settlement directly on-chain. This method eliminates the need for accounts or card fees, making it highly efficient for high-frequency, low-value, cross-border transactions between AI agents—such as paying for API calls, data streams, or computational resources—where traditional card fees and settlement times are impractical. While card networks excel in consumer-facing scenarios requiring dispute resolution, stablecoin protocols are tailored for machine economies. A key challenge for both is agent identity verification and transaction authorization. Notably, Visa and Mastercard are hedging their bets by also investing in stablecoins. Visa has rapidly grown its stablecoin settlement volume and is collaborating with Coinbase to bridge its network with the x402 protocol. Mastercard plans to acquire stablecoin platform BVNK. Their strategy is to become the fee-collecting gateway for all payment flows, regardless of the channel. Current applications reflect this division: consumer AI shopping tools (e.g., ChatGPT's checkout, Amazon's "Shop for Me") predominantly use card networks, while machine-focused services (e.g., Amazon Bedrock's core payments) adopt stablecoins via the x402 protocol. In the short term, a coexistence model is expected, with cards dominating retail and stablecoins powering machine transactions. The long-term outcome depends on whether AI-driven commerce evolves to resemble traditional retail or becomes a vast network of machine micropayments. By investing in both tracks, the incumbent card networks are positioning themselves to capture transaction fees regardless of which future prevails.

marsbit31m ago

The Battle for the AI Payment Race: Traditional Card Networks Face Off Against Coinbase

marsbit31m ago

Trading

Spot
Futures

Hot Articles

How to Buy CORE

Welcome to HTX.com! We've made purchasing CORE (CORE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy CORE (CORE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your CORE (CORE)After purchasing your CORE (CORE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade CORE (CORE)Easily trade CORE (CORE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

5.4k Total ViewsPublished 2024.03.29Updated 2026.06.02

How to Buy CORE

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of CORE (CORE) are presented below.

活动图片