For the First Time, Pure Human Video Pretrained VLA for Dexterous Manipulation: Deployable with Minimal Fine-Tuning Data

marsbit2026-06-08 tarihinde yayınlandı2026-06-08 tarihinde güncellendi

Özet

For the first time, a purely human-video-pretrained Vision-Language-Action (VLA) model for dexterous manipulation requires only a small amount of data for fine-tuning to achieve successful real-world deployment. Achieving human-level dexterous manipulation remains a core challenge in robotics. While multi-fingered hands offer hardware potential, Visual-Language-Action (VLA) models lag behind due to the high cost of collecting diverse, high-quality robot data. A novel framework, VITRA, developed by Microsoft Research Asia and Tsinghua University, addresses this by automatically transforming massive, unlabeled real-world human activity videos into a structured V-L-A training dataset. Key innovations include precise 3D hand motion annotation from monocular video, atomic action segmentation based on hand-speed minima, and automated instruction generation using VLMs combined with 3D trajectory visualization. This process created a massive dataset of 1 million clips. Pretrained exclusively on this human video data, the VLA model (combining a VLM backbone with a Diffusion Transformer action expert) demonstrates strong zero-shot hand motion prediction in unseen environments. Crucially, it requires minimal fine-tuning (~1.2k demonstrations) on real robot data to achieve high-success-rate dexterous manipulation tasks like grasping, placing, pouring, and sweeping on hardware like the Realman robot with the XHAND1 dexterous hand. The model shows exceptional generalization to novel obje...

Achieving human-level dexterous manipulation capability has long been a core challenge in the field of robotics.

Although multi-fingered dexterous hands possess hardware potential similar to humans, due to the high cost of acquiring high-quality robotic action data, existing Vision-Language-Action (VLA) models lag far behind large language models (LLMs) and vision-language models (VLMs) in terms of data scale and diversity, making it difficult to meet the demands of complex tasks in the real world.

A recent research paper titled "Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos" from Microsoft Research Asia (MSRA) in collaboration with Tsinghua University addresses this critical issue by proposing an innovative pre-training framework called VITRA.

The core contribution of this research lies in proposing a fully automated solution that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with the existing V-L-A training data format for robots.

By extracting 3D hand motion trajectories from videos, performing atomic-level action segmentation, and automatically generating language instructions, the research team constructed a large-scale hand V-L-A dataset containing 1 million clips and 26 million frames.

After pre-training solely on human video data, the model demonstrated powerful zero-shot hand action prediction capabilities in completely unseen real-world environments.

With only a small amount of fine-tuning using real robot data, it achieved high success rates in dexterous manipulation on real robots and exhibited strong generalization ability to new objects and environments.

More details follow below.

Bridging the Gap from Human Videos to Robot Data

The central challenge of the paper is how to overcome the vast difference between unstructured human videos and structured robot data, thereby extracting high-quality action labels and language instructions usable for VLA model pre-training.

This research built a complete system consisting of three core technologies, achieving seamless transformation from raw video to V-L-A data.

3D Motion Annotation: Accurately Recovering Hand and Camera Trajectories

Recovering precise 3D hand motion from monocular, uncalibrated, and potentially moving camera videos is an extremely challenging task.

This research proposes a monocular camera and hand pose tracking method based on the latest 3D vision technology:

First, it determines the camera state via background optical flow and estimates camera intrinsics.

Subsequently, it tracks camera pose using a visual SLAM method and a depth estimation model, and extracts the camera-space 3D hand pose per frame (including wrist 6D pose and full joint angles) using a hand reconstruction model.

Finally, by combining this information, it obtains the 3D hand motion trajectory in world space.

This method not only provides high-precision action labels but also lays the foundation for subsequent action segmentation and instruction annotation.

Atomic-Level Action Segmentation: Natural Segmentation Based on Velocity Minima

Existing robot V-L-A data typically consists of simple, short-horizon atomic-level tasks. Accurately segmenting these atomic actions from long videos is a difficult problem.

Inspired by the natural rhythm of human movements, the research team proposed a simple yet efficient segmentation algorithm: segmenting based on the minima of hand movement speed in 3D space.

During action transitions, human hands typically exhibit changes in speed, and speed minima often mark the switching of actions.

By detecting the speed minima of the 3D wrist trajectory in world space, this method can efficiently segment long videos into short clips containing a single atomic action, without requiring any additional manual annotation or model inference.

Instruction Annotation: Precise Action Description Combining 3D Trajectories

To generate accurate language instructions for the segmented video clips, the research team cleverly combined a Vision-Language Model (VLM) with 3D hand trajectories.

For each video clip, the system uniformly samples 8 frames and projects and overlays the palm's 3D trajectory onto these images.

Then, these images with trajectory highlighting are input into GPT-4, prompting it to describe the specified hand's action in an imperative sentence form, combining image content and trajectory information.

Experiments proved that providing atomic-level video clips overlaid with 3D hand trajectories significantly improves the accuracy of GPT-generated action descriptions.

Achieving Powerful Zero-Shot Prediction and Real-World Generalization

Based on the automatically constructed large-scale human hand V-L-A dataset, the research team designed and trained a VLA model specifically tailored for dexterous manipulation.

1. Model Architecture: Combining VLM and Diffusion Action Experts

This VLA model consists of a VLM backbone network (PaliGemma-2) and a diffusion action expert (Diffusion Transformer, DiT).

The VLM receives visual observations, language instructions, and camera field-of-view (FoV) information, outputting a "Cognition Feature".

The diffusion action expert receives this cognition feature, the current hand state, and a noisy action block with masking, iteratively denoising to predict future hand action sequences.

To handle fast-moving human hand actions and adapt to short-clip data, the model employs a Causal Attention mechanism for action denoising, ensuring each action step's prediction depends only on previous actions, effectively mitigating negative impacts from zero-padding.

2. Zero-Shot Hand Action Prediction: Demonstrating Remarkable Ability in Unseen Environments

In completely unseen real-life environments, the pre-trained model demonstrated powerful zero-shot hand action prediction capabilities.

In evaluations for grasping tasks and general action prediction tasks, this model significantly outperformed models trained on data collected in lab environments (like EgoDex), and also outperformed models trained using raw human-annotated data.

This fully demonstrates that pre-training with massive, diverse real-life videos can greatly enhance the model's generalization ability for complex environments and unknown objects.

3. Real-Robot Dexterous Manipulation: Efficient Deployment with Minimal Fine-Tuning Data

To deploy on real robots, the research team aligned the human hand's action space with that of the robot dexterous hand (e.g., Realman robot equipped with StarXHand1).

With only a small amount (about 1.2K instances) of fine-tuning using real robot teleoperation data on the pre-trained model, it could execute various dexterous manipulation tasks in the real world, including grasping, placing, pouring, and sweeping.

Experimental results show that compared to models not pre-trained on human VLA data or pre-trained on other datasets (like OXE, EgoDex), this method achieved significant improvement in task success rate, especially demonstrating remarkable robustness when facing unseen objects and backgrounds.

The Hardware Core Supporting VITRA's Real-World Deployment

The reason the VITRA framework can achieve stunning generalization on real robots relies not only on algorithmic innovations but also on the powerful support of the underlying hardware—

the domestically pioneered, fully direct-drive five-fingered dexterous hand StarXHand1 developed by StarX Robotics.

This framework forms a perfect "software-hardware synergy" with the hardware characteristics of StarXHand1, demonstrating irreplaceable deployment advantages in practical application scenarios.

High-Precision URDF and Seamless Connection to Human Hand Action Space

The core breakthrough of the VITRA framework lies in aligning the human hand action space with the robot dexterous hand's action space.

StarXHand1 officially provides an extremely high-precision URDF model, which not only accurately describes motion and dynamics parameters but also perfectly maps the spatial distribution of human hand joints.

This "digital twin"-level model support enables VITRA to precisely map human joint angles to the corresponding joints of StarXHand1 during the fine-tuning phase, thereby significantly reducing the reality gap from human videos to real hardware and ensuring the efficient deployment of pre-trained strategies on real hardware.

Fully Direct-Drive Architecture and High-Frequency Response: Perfect Execution of Complex Dexterous Operations

When performing complex dexterous operations such as pouring and sweeping, robots require extremely high dynamic response capability.

The fully direct-drive motor architecture adopted by StarXHand1 provides the most ideal hardware foundation for this algorithm.

The fully direct-drive design fundamentally eliminates the significant friction, hysteresis, and nonlinear interference brought by traditional reducers, endowing the dexterous hand with super-sensitive dynamic response capabilities. This enables StarXHand1 to instantly and precisely execute the action commands output by the VITRA model, safely manipulating various unknown objects.

Rich Sensor Array: Reserving Space for Future Multimodal Perception

Although the current VITRA model primarily relies on visual input, the rich sensor array equipped on StarXHand1 (such as high-resolution tactile arrays) reserves vast space for future multimodal perception.

Combined with StarXHand1's powerful hardware perception capabilities, future VLA models are expected to further integrate tactile feedback, handling more delicate and complex "Finger Gaits" tasks.

The Scaling Law of Data Size

This research also explored the impact of pre-training data scale on model performance.

Experiments found that as the amount of pre-training data increased, the model's error in zero-shot hand action prediction tasks steadily decreased, and its success rate in real robot manipulation tasks continuously rose.

This clear scaling behavior indicates that by further expanding the scale of human video data, the performance of VLA models is expected to continuously improve.

This achievement marks a key breakthrough in utilizing unstructured human videos for robotic VLA model pre-training.

By providing a fully automated data conversion solution, this research significantly lowers the barrier to acquiring high-quality robot training data, paving the way for the application of multi-fingered dexterous hands in broader, more complex real-world scenarios, and laying a solid foundation for moving towards truly generalized embodied intelligence.

Paper link: https://arxiv.org/abs/2510.21571

This article is from WeChat Official Account "QbitAI", author: VITRA Team

İlgili Sorular

QWhat is the main contribution of the VITRA framework introduced in the article?

AThe main contribution of VITRA is a fully automated framework that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with robot VLA training data formats. It creates a large-scale hand V-L-A dataset (1M clips, 26M frames) for pre-training, enabling models to achieve strong zero-shot prediction and, after minimal robot data fine-tuning, high success in real robot dexterous manipulation.

QWhat are the three core technical components for converting human videos into V-L-A data in the VITRA framework?

AThe three core technical components are: 1) 3D motion annotation for recovering precise hand and camera trajectories from monocular videos, 2) Atomic-level action segmentation based on velocity minima in 3D hand trajectories, and 3) Instruction annotation using VLM (like GPT-4) prompted with images and overlaid 3D hand trajectory to generate accurate action descriptions.

QHow does the pre-trained VLA model achieve zero-shot hand action prediction in unseen environments?

AThe VLA model, pre-trained on the large-scale human video dataset, shows strong zero-shot hand action prediction capabilities in completely unseen real-world environments. Its architecture combines a VLM backbone for processing visual input and instructions to output a cognition feature, and a diffusion action expert to predict future hand action sequences through iterative denoising, outperforming models trained on lab data or human-annotated datasets.

QHow is the VITRA framework deployed on a real robot for dexterous manipulation tasks?

ATo deploy on a real robot, the VITRA framework aligns the human hand action space with the robot hand's action space. After pre-training on human videos, the model is fine-tuned using a small amount of real robot teleoperation data (e.g., ~1.2K demos). This fine-tuned model can then execute various dexterous tasks like grasping, placing, pouring, and sweeping on the real robot with high success rates and strong generalization to new objects and backgrounds.

QWhat role does the Xingdong XHAND1 dexterous hand hardware play in the successful deployment of VITRA?

AThe Xingdong XHAND1 dexterous hand provides crucial hardware support for VITRA's deployment. Its high-precision URDF model enables seamless alignment between human and robot action spaces. Its full direct-drive architecture offers high-frequency dynamic response, perfectly executing complex VITRA commands. Its rich sensor array also leaves space for future multi-modal perception integration, forming a powerful 'software-hardware synergy' for real-world application.

İlgili Okumalar

Jensen Huang 'Saves' South Korean Stock Market: Locks In SK Hynix Memory, Chip Shortage to Continue

On June 5th, South Korea's stock market experienced a sharp decline, with major chipmakers like Samsung and SK Hynix dropping nearly 10%. Amidst the turmoil, NVIDIA CEO Jensen Huang's visit to Seoul played a dramatic role in boosting market sentiment. Following a dinner meeting with SK Group Chairman Chey Tae-won and SK Hynix CEO Kwak Noh-Jung, Huang confirmed that NVIDIA's new Vera CPU will utilize SK Hynix DRAM. The companies announced a multi-year technical partnership to co-develop next-generation memory for NVIDIA's AI infrastructure, covering products from data centers to personal AI and robotics. This collaboration extends beyond memory supply. SK Hynix is integrating NVIDIA's AI and Omniverse platform into its own semiconductor design and manufacturing processes, including computational lithography and creating digital twins of its fabrication plants for autonomous operation. While strengthening ties with SK Hynix, NVIDIA is diversifying its supply chain for the upcoming HBM4 memory, with Samsung, SK Hynix, and Micron all certified as suppliers for its Vera Rubin platform. Despite this, Huang warned that the global chip shortage, driven by relentless demand from AI factory construction, is expected to persist for several years across the entire supply chain. His visit underscores NVIDIA's systematic effort to deepen integration with South Korea's broader tech industry.

marsbit22 dk önce

Jensen Huang 'Saves' South Korean Stock Market: Locks In SK Hynix Memory, Chip Shortage to Continue

marsbit22 dk önce

Nasdaq Plunges 4.2% in a Single Day: Does "Black Friday" Burst the U.S. Stock Market Bubble?

The Nasdaq plunged 4.18% on June 5, 2026, its worst single-day drop in over a year, as a much stronger-than-expected US jobs report triggered fears of economic overheating and delayed Federal Reserve interest rate cuts. The selloff, centered on high-valuation tech and AI stocks like Nvidia and Broadcom, spread across major indices. The article examines whether this signals a market top. The strong May non-farm payrolls data, nearly double expectations, pushed bond yields higher, directly hurting rate-sensitive tech stocks. This exposed vulnerabilities in the crowded AI trade, where valuations had soared on narratives of infinite growth, despite emerging signs of slowing order momentum and corporate AI monetization challenges. Prior to the drop, market indicators flashed warning signs: historically high valuations (e.g., Shiller CAPE ratio near 39.5), extreme bullish sentiment, and high levels of leverage. Technical charts showed key support levels being breached. Wall Street is divided on the outlook. Bears, citing risks of "stagflation" and AI bubble comparisons to the dot-com era, warn of a potential significant correction. Bulls view the drop as a healthy correction within a bull market, underpinned by a strong economy and expected corporate earnings growth of around 7% in 2026. The immediate future hinges on upcoming key events: the May CPI inflation data and the mid-June FOMC meeting. Their outcomes will critically shape market expectations for the Fed's rate path. The article concludes that conditions for a major market top are aligning, marking a fragile transition from narrative-driven gains to a phase demanding validation from macroeconomic data and corporate fundamentals. Caution is advised.

marsbit26 dk önce

Nasdaq Plunges 4.2% in a Single Day: Does "Black Friday" Burst the U.S. Stock Market Bubble?

marsbit26 dk önce

Nasdaq Plunges 4.2% in a Single Day, Did 'Black Friday' Pop the U.S. Stock Bubble?

The Nasdaq Composite plummeted 4.18% on June 5, its biggest single-day drop since April 2025, triggering widespread debate over whether the U.S. stock market has peaked. The sell-off was sparked by a stronger-than-expected U.S. non-farm payrolls report, which fueled fears of economic overheating and pushed back market expectations for Federal Reserve rate cuts, leading to a sharp rise in Treasury yields. The AI sector, the primary driver of the recent bull market, suffered severe losses, with the Philadelphia Semiconductor Index crashing over 10%. Stocks like Nvidia, Broadcom, and Micron led the decline. Concerns are mounting about the sustainability of AI capital expenditures and high valuations, with signs of order cuts for next-generation chips emerging. Analyses point to several warning signs: historically high market valuations (e.g., elevated Shiller CAPE ratio, Buffett Indicator), extreme bullish sentiment indicators, and significant insider selling. The sell-off also caused a key technical breakdown, with the S&P 500 breaking below its short-term moving average and testing its 200-day moving average. Wall Street is divided on the outlook. Bears warn this could be the start of a bubble deflation or a "stagflation" scenario, while bulls view it as a healthy, overdue correction within a bull market driven by solid corporate earnings growth. A more moderate view suggests the easy liquidity-driven rally is over, and markets are entering a phase of fundamental stock-picking with potential for consolidation. The immediate future hinges on key upcoming events: the May CPI report and the mid-June FOMC meeting. Their outcomes will be critical in determining whether this is a temporary pullback or the beginning of a more significant trend reversal. The consensus is that the era of one-directional market gains may be ending, requiring increased investor caution.

Odaily星球日报32 dk önce

Nasdaq Plunges 4.2% in a Single Day, Did 'Black Friday' Pop the U.S. Stock Bubble?

Odaily星球日报32 dk önce

The First Case on AI Agents: What Was Adjudicated?

"The First 'Agent' Ruling: What Was Decided?" On April 30, the Guangzhou Internet Court issued a ruling—China's first behavior preservation order in the intelligent agent (AI agent) field. The defendant, an open-source AI agent software, was ordered to stop downloads, cease actions that bypassed a platform's technical protection measures, and delete related tutorials and data. The core issue: the software used the operating system's "accessibility service" permissions to automate user interactions within other apps without those platforms' authorization. This mirrors a recent US case where Amazon sued Perplexity for similar reasons—bypassing Amazon's API to directly scrape and interact with its pages—and won a preliminary injunction. Both rulings establish a crucial legal boundary for the AI agent era: agents cannot operate unchecked. The article argues the fundamental legal principle emerging is one of **dual authorization**. An AI agent requires both **user consent** AND **platform consent** to operate legitimately within that platform's ecosystem. Bypassing platform rules through system-level permissions, even with user permission, undermines platform responsibilities for content moderation, data security, and user privacy, creating liability issues. The piece uses the evolution of "Doubao Phone" (an AI-integrated smartphone) as a case study. Its initial, aggressive version that bypassed platform controls faced roadblocks. Its upcoming 2.0 version is reportedly pivoting to negotiate API access and authorization deals with major platforms (like Alibaba's ecosystem), seen as a strategic adaptation to the new regulatory reality. A global trend is identified: the era of unregulated, "wild west" growth for AI agents is ending, replaced by a **compliance race**. This raises barriers to entry, as securing platform authorizations becomes a new cost. Open-source status is also not a legal shield if the code facilitates bypassing technical protections. In conclusion, these first rulings target not the largest, but the most **aggressive and representative** cases. By setting precedent with them, regulators are efficiently steering the entire industry towards a new, more regulated operating paradigm defined by dual authorization and platform cooperation.

marsbit36 dk önce

The First Case on AI Agents: What Was Adjudicated?

marsbit36 dk önce

Fired by Google Over a 14-Page Paper, Over 4,000 Rallied for Her. 6 Years Later: She Almost Predicted the Entire AI Era Back Then.

In late 2020, Google AI researcher Timnit Gebru was effectively dismissed following a conflict over a 14-page, unpublished research paper she co-authored titled "On the Dangers of Stochastic Parrots." The paper, which has since been cited over 14,000 times, raised critical early warnings about the risks of large language models (LLMs). It argued that these models, trained on vast, biased internet data, are essentially "stochastic parrots" that mimic language without true understanding, potentially amplifying societal biases, generating plausible but false information (later termed "AI hallucination"), consuming massive energy, and obscuring their training data contents. Gebru's stance led to a clash with Google management, who requested the paper's withdrawal. Her subsequent internal criticism of the company's diversity efforts and handling of the matter culminated in her termination, which sparked protests from over 4,000 Google employees and researchers. Six years later, the paper's predictions have proven remarkably prescient. Issues like AI hallucination, embedded bias (evident in resume screening and healthcare algorithms), soaring energy consumption from AI data centers, unvetted training data containing harmful content, and the risk of "model collapse" from AI-generated internet content have become central industry challenges. The incident also highlighted concerns about AI development being driven primarily by commercial competition within a handful of powerful tech companies, often at the expense of ethical considerations. After leaving Google, Gebru founded the Distributed AI Research Institute (DAIR) to explore these issues independently. The controversy underscores how her early, critical insights into the fundamental limitations and societal impacts of LLMs anticipated many of the most pressing dilemmas in today's AI era.

marsbit38 dk önce

Fired by Google Over a 14-Page Paper, Over 4,000 Rallied for Her. 6 Years Later: She Almost Predicted the Entire AI Era Back Then.

marsbit38 dk önce

İşlemler

Spot
Futures

Popüler Makaleler

CORE Nasıl Satın Alınır

HTX.com’a hoş geldiniz! CORE (CORE) satın alma işlemlerini basit ve kullanışlı bir hâle getirdik. Adım adım açıkladığımız rehberimizi takip ederek kripto yolculuğunuza başlayın. 1. Adım: HTX Hesabınızı OluşturunHTX'te ücretsiz bir hesap açmak için e-posta adresinizi veya telefon numaranızı kullanın. Sorunsuzca kaydolun ve tüm özelliklerin kilidini açın. Hesabımı Aç2. Adım: Kripto Satın Al Bölümüne Gidin ve Ödeme Yönteminizi SeçinKredi/Banka Kartı: Visa veya Mastercard'ınızı kullanarak anında CORE (CORE) satın alın.Bakiye: Sorunsuz bir şekilde işlem yapmak için HTX hesap bakiyenizdeki fonları kullanın.Üçüncü Taraflar: Kullanımı kolaylaştırmak için Google Pay ve Apple Pay gibi popüler ödeme yöntemlerini ekledik.P2P: HTX'teki diğer kullanıcılarla doğrudan işlem yapın.Borsa Dışı (OTC): Yatırımcılar için kişiye özel hizmetler ve rekabetçi döviz kurları sunuyoruz.3. Adım: CORE (CORE) Varlıklarınızı SaklayınCORE (CORE) satın aldıktan sonra HTX hesabınızda saklayın. Alternatif olarak, blok zinciri transferi yoluyla başka bir yere gönderebilir veya diğer kripto para birimlerini takas etmek için kullanabilirsiniz.4. Adım: CORE (CORE) Varlıklarınızla İşlem YapınHTX'in spot piyasasında CORE (CORE) ile kolayca işlemler yapın.Hesabınıza erişin, işlem çiftinizi seçin, işlemlerinizi gerçekleştirin ve gerçek zamanlı olarak izleyin. Hem yeni başlayanlar hem de deneyimli yatırımcılar için kullanıcı dostu bir deneyim sunuyoruz.

276 Toplam GörüntülenmeYayınlanma 2024.12.13Güncellenme 2026.06.02

CORE Nasıl Satın Alınır

Tartışmalar

HTX Topluluğuna hoş geldiniz. Burada, en son platform gelişmeleri hakkında bilgi sahibi olabilir ve profesyonel piyasa görüşlerine erişebilirsiniz. Kullanıcıların CORE (CORE) fiyatı hakkındaki görüşleri aşağıda sunulmaktadır.

活动图片