For the First Time, Pure Human Video Pretrained VLA for Dexterous Manipulation: Deployable with Minimal Fine-Tuning Data

marsbitXuất bản vào 2026-06-08Cập nhật gần nhất vào 2026-06-08

Tóm tắt

For the first time, a purely human-video-pretrained Vision-Language-Action (VLA) model for dexterous manipulation requires only a small amount of data for fine-tuning to achieve successful real-world deployment. Achieving human-level dexterous manipulation remains a core challenge in robotics. While multi-fingered hands offer hardware potential, Visual-Language-Action (VLA) models lag behind due to the high cost of collecting diverse, high-quality robot data. A novel framework, VITRA, developed by Microsoft Research Asia and Tsinghua University, addresses this by automatically transforming massive, unlabeled real-world human activity videos into a structured V-L-A training dataset. Key innovations include precise 3D hand motion annotation from monocular video, atomic action segmentation based on hand-speed minima, and automated instruction generation using VLMs combined with 3D trajectory visualization. This process created a massive dataset of 1 million clips. Pretrained exclusively on this human video data, the VLA model (combining a VLM backbone with a Diffusion Transformer action expert) demonstrates strong zero-shot hand motion prediction in unseen environments. Crucially, it requires minimal fine-tuning (~1.2k demonstrations) on real robot data to achieve high-success-rate dexterous manipulation tasks like grasping, placing, pouring, and sweeping on hardware like the Realman robot with the XHAND1 dexterous hand. The model shows exceptional generalization to novel obje...

Achieving human-level dexterous manipulation capability has long been a core challenge in the field of robotics.

Although multi-fingered dexterous hands possess hardware potential similar to humans, due to the high cost of acquiring high-quality robotic action data, existing Vision-Language-Action (VLA) models lag far behind large language models (LLMs) and vision-language models (VLMs) in terms of data scale and diversity, making it difficult to meet the demands of complex tasks in the real world.

A recent research paper titled "Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos" from Microsoft Research Asia (MSRA) in collaboration with Tsinghua University addresses this critical issue by proposing an innovative pre-training framework called VITRA.

The core contribution of this research lies in proposing a fully automated solution that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with the existing V-L-A training data format for robots.

By extracting 3D hand motion trajectories from videos, performing atomic-level action segmentation, and automatically generating language instructions, the research team constructed a large-scale hand V-L-A dataset containing 1 million clips and 26 million frames.

After pre-training solely on human video data, the model demonstrated powerful zero-shot hand action prediction capabilities in completely unseen real-world environments.

With only a small amount of fine-tuning using real robot data, it achieved high success rates in dexterous manipulation on real robots and exhibited strong generalization ability to new objects and environments.

More details follow below.

Bridging the Gap from Human Videos to Robot Data

The central challenge of the paper is how to overcome the vast difference between unstructured human videos and structured robot data, thereby extracting high-quality action labels and language instructions usable for VLA model pre-training.

This research built a complete system consisting of three core technologies, achieving seamless transformation from raw video to V-L-A data.

△

3D Motion Annotation: Accurately Recovering Hand and Camera Trajectories

Recovering precise 3D hand motion from monocular, uncalibrated, and potentially moving camera videos is an extremely challenging task.

This research proposes a monocular camera and hand pose tracking method based on the latest 3D vision technology:

First, it determines the camera state via background optical flow and estimates camera intrinsics.

Subsequently, it tracks camera pose using a visual SLAM method and a depth estimation model, and extracts the camera-space 3D hand pose per frame (including wrist 6D pose and full joint angles) using a hand reconstruction model.

Finally, by combining this information, it obtains the 3D hand motion trajectory in world space.

This method not only provides high-precision action labels but also lays the foundation for subsequent action segmentation and instruction annotation.

Atomic-Level Action Segmentation: Natural Segmentation Based on Velocity Minima

Existing robot V-L-A data typically consists of simple, short-horizon atomic-level tasks. Accurately segmenting these atomic actions from long videos is a difficult problem.

Inspired by the natural rhythm of human movements, the research team proposed a simple yet efficient segmentation algorithm: segmenting based on the minima of hand movement speed in 3D space.

During action transitions, human hands typically exhibit changes in speed, and speed minima often mark the switching of actions.

By detecting the speed minima of the 3D wrist trajectory in world space, this method can efficiently segment long videos into short clips containing a single atomic action, without requiring any additional manual annotation or model inference.

Instruction Annotation: Precise Action Description Combining 3D Trajectories

To generate accurate language instructions for the segmented video clips, the research team cleverly combined a Vision-Language Model (VLM) with 3D hand trajectories.

For each video clip, the system uniformly samples 8 frames and projects and overlays the palm's 3D trajectory onto these images.

Then, these images with trajectory highlighting are input into GPT-4, prompting it to describe the specified hand's action in an imperative sentence form, combining image content and trajectory information.

Experiments proved that providing atomic-level video clips overlaid with 3D hand trajectories significantly improves the accuracy of GPT-generated action descriptions.

Achieving Powerful Zero-Shot Prediction and Real-World Generalization

Based on the automatically constructed large-scale human hand V-L-A dataset, the research team designed and trained a VLA model specifically tailored for dexterous manipulation.

△

1. Model Architecture: Combining VLM and Diffusion Action Experts

This VLA model consists of a VLM backbone network (PaliGemma-2) and a diffusion action expert (Diffusion Transformer, DiT).

The VLM receives visual observations, language instructions, and camera field-of-view (FoV) information, outputting a "Cognition Feature".

The diffusion action expert receives this cognition feature, the current hand state, and a noisy action block with masking, iteratively denoising to predict future hand action sequences.

To handle fast-moving human hand actions and adapt to short-clip data, the model employs a Causal Attention mechanism for action denoising, ensuring each action step's prediction depends only on previous actions, effectively mitigating negative impacts from zero-padding.

2. Zero-Shot Hand Action Prediction: Demonstrating Remarkable Ability in Unseen Environments

In completely unseen real-life environments, the pre-trained model demonstrated powerful zero-shot hand action prediction capabilities.

△

In evaluations for grasping tasks and general action prediction tasks, this model significantly outperformed models trained on data collected in lab environments (like EgoDex), and also outperformed models trained using raw human-annotated data.

This fully demonstrates that pre-training with massive, diverse real-life videos can greatly enhance the model's generalization ability for complex environments and unknown objects.

3. Real-Robot Dexterous Manipulation: Efficient Deployment with Minimal Fine-Tuning Data

To deploy on real robots, the research team aligned the human hand's action space with that of the robot dexterous hand (e.g., Realman robot equipped with StarXHand1).

△

With only a small amount (about 1.2K instances) of fine-tuning using real robot teleoperation data on the pre-trained model, it could execute various dexterous manipulation tasks in the real world, including grasping, placing, pouring, and sweeping.

Experimental results show that compared to models not pre-trained on human VLA data or pre-trained on other datasets (like OXE, EgoDex), this method achieved significant improvement in task success rate, especially demonstrating remarkable robustness when facing unseen objects and backgrounds.

The Hardware Core Supporting VITRA's Real-World Deployment

The reason the VITRA framework can achieve stunning generalization on real robots relies not only on algorithmic innovations but also on the powerful support of the underlying hardware—

the domestically pioneered, fully direct-drive five-fingered dexterous hand StarXHand1 developed by StarX Robotics.

This framework forms a perfect "software-hardware synergy" with the hardware characteristics of StarXHand1, demonstrating irreplaceable deployment advantages in practical application scenarios.

△

High-Precision URDF and Seamless Connection to Human Hand Action Space

The core breakthrough of the VITRA framework lies in aligning the human hand action space with the robot dexterous hand's action space.

StarXHand1 officially provides an extremely high-precision URDF model, which not only accurately describes motion and dynamics parameters but also perfectly maps the spatial distribution of human hand joints.

This "digital twin"-level model support enables VITRA to precisely map human joint angles to the corresponding joints of StarXHand1 during the fine-tuning phase, thereby significantly reducing the reality gap from human videos to real hardware and ensuring the efficient deployment of pre-trained strategies on real hardware.

Fully Direct-Drive Architecture and High-Frequency Response: Perfect Execution of Complex Dexterous Operations

When performing complex dexterous operations such as pouring and sweeping, robots require extremely high dynamic response capability.

The fully direct-drive motor architecture adopted by StarXHand1 provides the most ideal hardware foundation for this algorithm.

The fully direct-drive design fundamentally eliminates the significant friction, hysteresis, and nonlinear interference brought by traditional reducers, endowing the dexterous hand with super-sensitive dynamic response capabilities. This enables StarXHand1 to instantly and precisely execute the action commands output by the VITRA model, safely manipulating various unknown objects.

Rich Sensor Array: Reserving Space for Future Multimodal Perception

Although the current VITRA model primarily relies on visual input, the rich sensor array equipped on StarXHand1 (such as high-resolution tactile arrays) reserves vast space for future multimodal perception.

Combined with StarXHand1's powerful hardware perception capabilities, future VLA models are expected to further integrate tactile feedback, handling more delicate and complex "Finger Gaits" tasks.

The Scaling Law of Data Size

This research also explored the impact of pre-training data scale on model performance.

△

Experiments found that as the amount of pre-training data increased, the model's error in zero-shot hand action prediction tasks steadily decreased, and its success rate in real robot manipulation tasks continuously rose.

This clear scaling behavior indicates that by further expanding the scale of human video data, the performance of VLA models is expected to continuously improve.

This achievement marks a key breakthrough in utilizing unstructured human videos for robotic VLA model pre-training.

By providing a fully automated data conversion solution, this research significantly lowers the barrier to acquiring high-quality robot training data, paving the way for the application of multi-fingered dexterous hands in broader, more complex real-world scenarios, and laying a solid foundation for moving towards truly generalized embodied intelligence.

Paper link: https://arxiv.org/abs/2510.21571

This article is from WeChat Official Account "QbitAI", author: VITRA Team

Câu hỏi Liên quan

QWhat is the main contribution of the VITRA framework introduced in the article?

AThe main contribution of VITRA is a fully automated framework that converts massive amounts of unlabeled real-world human activity videos into data perfectly aligned with robot VLA training data formats. It creates a large-scale hand V-L-A dataset (1M clips, 26M frames) for pre-training, enabling models to achieve strong zero-shot prediction and, after minimal robot data fine-tuning, high success in real robot dexterous manipulation.

QWhat are the three core technical components for converting human videos into V-L-A data in the VITRA framework?

AThe three core technical components are: 1) 3D motion annotation for recovering precise hand and camera trajectories from monocular videos, 2) Atomic-level action segmentation based on velocity minima in 3D hand trajectories, and 3) Instruction annotation using VLM (like GPT-4) prompted with images and overlaid 3D hand trajectory to generate accurate action descriptions.

QHow does the pre-trained VLA model achieve zero-shot hand action prediction in unseen environments?

AThe VLA model, pre-trained on the large-scale human video dataset, shows strong zero-shot hand action prediction capabilities in completely unseen real-world environments. Its architecture combines a VLM backbone for processing visual input and instructions to output a cognition feature, and a diffusion action expert to predict future hand action sequences through iterative denoising, outperforming models trained on lab data or human-annotated datasets.

QHow is the VITRA framework deployed on a real robot for dexterous manipulation tasks?

ATo deploy on a real robot, the VITRA framework aligns the human hand action space with the robot hand's action space. After pre-training on human videos, the model is fine-tuned using a small amount of real robot teleoperation data (e.g., ~1.2K demos). This fine-tuned model can then execute various dexterous tasks like grasping, placing, pouring, and sweeping on the real robot with high success rates and strong generalization to new objects and backgrounds.

QWhat role does the Xingdong XHAND1 dexterous hand hardware play in the successful deployment of VITRA?

AThe Xingdong XHAND1 dexterous hand provides crucial hardware support for VITRA's deployment. Its high-precision URDF model enables seamless alignment between human and robot action spaces. Its full direct-drive architecture offers high-frequency dynamic response, perfectly executing complex VITRA commands. Its rich sensor array also leaves space for future multi-modal perception integration, forming a powerful 'software-hardware synergy' for real-world application.

Nội dung Liên quan

Huang Renxun 'Giải cứu' thị trường chứng khoán Hàn Quốc: Khóa chặt bộ nhớ SK Hynix, tình trạng thiếu chip sẽ còn kéo dài

Ngày 5 tháng 6, thị trường chứng khoán Hàn Quốc chứng kiến đợt sụt giảm mạnh, với cổ phiếu của các gã khổng lồ bán dẫn như Samsung và SK Hynix giảm gần 10%. Trong bối cảnh này, chuyến thăm của ông Jensen Huang, CEO NVIDIA, đã mang lại một tín hiệu tích cực. Sau cuộc gặp gỡ với Chủ tịch SK Group Choi Tae-won và CEO SK Hynix Kwak Noh-jung, ông Huang xác nhận bộ vi xử lý Vera CPU mới của NVIDIA sẽ sử dụng DRAM từ SK Hynix. Hai bên cũng công bố một thỏa thuận hợp tác công nghệ đa năm, nhằm phát triển bộ nhớ thế hệ tiếp theo cho cơ sở hạ tầng AI của NVIDIA, bao gồm siêu máy tính AI Vera Rubin, PC AI và nền tảng robot. Hợp tác nhằm đảm bảo nguồn cung bộ nhớ tiên tiến cho các chu kỳ phát triển dài và yêu cầu sản xuất phức tạp. Ngoài vai trò nhà cung cấp, SK Hynix còn áp dụng công nghệ AI của NVIDIA (như CUDA-X, Omniverse) vào quy trình thiết kế và sản xuất chip của chính mình, bao gồm mô phỏng bán dẫn, tính toán quang khắc và xây dựng bản sao số (digital twin) cho nhà máy, hướng tới vận hành tự động. Về nguồn cung HBM4 thế hệ mới, ông Huang cho biết cả ba nhà cung cấp - SK Hynix, Samsung và Micron - đều đã được chứng nhận và đang sản xuất để hỗ trợ kiến trúc Vera Rubin. Tuy nhiên, ông cảnh báo tình trạng thiếu hụt chip bộ nhớ trên toàn chuỗi cung ứng, từ wafer, đóng gói đến quang tử silic, sẽ còn kéo dài trong nhiều năm do nhu cầu AI cực cao. Chuyến thăm này cũng cho thấy NVIDIA đang tăng cường mối liên kết chiến lược với toàn bộ ngành công nghệ Hàn Quốc, thông qua các cuộc gặp với các tập đoàn lớn như Hyundai, LG, Samsung và kế hoạch mở rộng trung tâm R&D tại đây.

marsbit20 phút trước

Huang Renxun 'Giải cứu' thị trường chứng khoán Hàn Quốc: Khóa chặt bộ nhớ SK Hynix, tình trạng thiếu chip sẽ còn kéo dài

marsbit20 phút trước

Chỉ số Nasdaq giảm 4.2% trong một ngày, 'Thứ Sáu Đen tối' có chọc vỡ bong bóng thị trường chứng khoán Mỹ?

Ngày 5 tháng 6, thị trường chứng khoán Mỹ chứng kiến đợt điều chỉnh mạnh mẽ nhất trong năm 2026. Chỉ số Nasdaq sụt giảm 4,18%, S&P 500 giảm 2,64% và Dow Jones giảm 1,35%. Đặc biệt, chỉ số Philadelphia Semiconductor Index lao dốc hơn 10%, khiến giá trị vốn hóa của các công ty AI cốt lõi như NVIDIA, Broadcom, Micron và Marvell bốc hơi mạnh. Nguyên nhân trực tiếp đến từ báo cáo việc làm phi nông nghiệp tháng 5 vượt kỳ vọng, làm dấy lên lo ngại lạm phát và khiến thị trường dự đoán Cục Dự trữ Liên bang (Fed) có thể bắt đầu tăng lãi suất sớm từ tháng 10. Lợi suất trái phiếu tăng vọt khiến các cổ phiếu công nghệ có định giá cao chịu áp lực bán tháo mạnh. Đợt sụt giảm này cũng làm nổi bật những lo ngại về sự phồng giá trong lĩnh vực AI, khi tốc độ tăng trưởng doanh thu bắt đầu chậm lại và các khoản đầu tư cơ sở hạ tầng chưa mang lại lợi nhuận nhanh như kỳ vọng. Định giá tổng thể của thị trường đang ở mức cao lịch sử, với chỉ số CAPE của S&P 500 đạt khoảng 39,5 và "chỉ số Buffett" vượt 237%, cảnh báo rủi ro điều chỉnh. Các chuyên gia có quan điểm trái chiều: phe bi quan cảnh báo đây có thể là khởi đầu của sự điều chỉnh bong bóng, trong khi phe lạc quan coi đây là đợt điều chỉnh lành mạnh trong xu hướng tăng dài hạn, được hỗ trợ bởi tăng trưởng lợi nhuận doanh nghiệp và nền tảng kinh tế vững chắc. Tương lai thị trường sẽ phụ thuộc vào các dữ liệu kinh tế quan trọng sắp tới, đặc biệt là báo cáo CPI tháng 5 và quyết định của Fed tại cuộc họp FOMC giữa tháng 6, để xác định lộ trình lãi suất và đánh giá lại kỳ vọng lạm phát.

marsbit23 phút trước

Chỉ số Nasdaq giảm 4.2% trong một ngày, 'Thứ Sáu Đen tối' có chọc vỡ bong bóng thị trường chứng khoán Mỹ?

marsbit23 phút trước

Chỉ số Nasdaq giảm 4,2% trong một ngày, 'Ngày Thứ Sáu Đen Tối' làm vỡ bong bóng thị trường chứng khoán Mỹ?

Ngày 5 tháng 6, thị trường chứng khoán Mỹ chứng kiến đợt điều chỉnh mạnh nhất trong năm 2026. Chỉ số Nasdaq Composite sụt giảm 4.18%, trong khi chỉ số S&P 500 giảm 2.64%, chấm dứt chuỗi tăng 9 tuần liên tiếp. Lĩnh vực bán dẫn, đặc biệt là các cổ phiếu AI cốt lõi như NVIDIA, AMD, bị ảnh hưởng nặng nề. Nguyên nhân trực tiếp được cho là báo cáo việc làm phi nông nghiệp (Non-Farm Payrolls) tháng 5 mạnh mẽ bất ngờ, làm dấy lên lo ngại về lạm phát và khiến kỳ vọng về việc Cục Dự trữ Liên bang (Fed) cắt giảm lãi suất bị đẩy lùi. Lợi suất trái phiếu kho bạc tăng vọt, gây áp lực lên các cổ phiếu công nghệ có định giá cao vốn nhạy cảm với lãi suất. Sự điều chỉnh này diễn ra trong bối cảnh định giá thị trường chung đang ở mức cao lịch sử, được đo lường bởi các chỉ số như CAPE hay "Chỉ số Buffett". Cùng lúc đó, tâm lý đầu tư vào chủ đề AI - động lực chính của thị trường trong 18 tháng qua - bắt đầu xuất hiện vết nứt. Có những dấu hiệu cho thấy tốc độ tăng trưởng doanh thu có thể chậm lại và việc triển khai ứng dụng AI trong doanh nghiệp không nhanh như kỳ vọng trước đây. Giới phân tích chia thành hai luồng quan điểm: Một bên cảnh báo đây có thể là khởi đầu cho đợt điều chỉnh lớn hơn sau khi bong bóng AI đạt đỉnh, trong khi bên kia xem đây là đợt điều chỉnh lành mạnh, cần thiết trong một thị trường tăng giá, với nền tảng là tăng trưởng lợi nhuận doanh nghiệp vẫn ổn định. Tương lai ngắn hạn của thị trường sẽ phụ thuộc nhiều vào dữ liệu lạm phát CPI tháng 5 sắp được công bố và cuộc họp sắp tới của Fed. Những sự kiện này sẽ làm rõ lộ trình chính sách tiền tệ và kiểm chứng lại niềm tin vào câu chuyện tăng trưởng AI. Thời kỳ tăng giá một chiều có thể đã kết thúc, và thị trường đang bước vào giai đoạn nhạy cảm, nơi các yếu tố cơ bản và dữ liệu vĩ mô sẽ được xem xét khắt khe hơn.

Odaily星球日报29 phút trước

Chỉ số Nasdaq giảm 4,2% trong một ngày, 'Ngày Thứ Sáu Đen Tối' làm vỡ bong bóng thị trường chứng khoán Mỹ?

Odaily星球日报29 phút trước

Vụ Án Đầu Tiên Về Tác Nhân Thông Minh, Phán Quyết Gì?

Vào ngày 30 tháng 4, Tòa án Internet Quảng Châu (Trung Quốc) đã đưa ra phán quyết sơ bộ đầu tiên trong lĩnh vực trợ lý AI, yêu cầu một phần mềm trợ lý AI mã nguồn mở ngừng cung cấp tải xuống và xóa dữ liệu vì đã vượt qua các biện pháp quản lý kỹ thuật của nền tảng để thao tác tự động. Trước đó không lâu, tòa án liên bang Mỹ cũng ra lệnh cấm sơ bộ chống lại Perplexity với lý do tương tự trong vụ kiện của Amazon. Hai phán quyết song song này thiết lập một "ranh giới pháp lý" rõ ràng cho thời đại trợ lý AI: hành vi bỏ qua sự cho phép của nền tảng để truy cập và thao tác là bất hợp pháp. Cốt lõi vấn đề nằm ở khái niệm "ủy quyền kép": trợ lý AI không chỉ cần sự đồng ý của người dùng mà còn phải có sự cho phép rõ ràng từ nền tảng mục tiêu. Việc sử dụng các quyền như "dịch vụ trợ năng" của hệ điều hành để vượt qua các quy tắc của ứng dụng có thể làm mất hiệu lực các cơ chế bảo mật, quyền riêng tư và kiểm duyệt nội dung của nền tảng, gây ra vấn đề về trách nhiệm. Trường hợp của "Điện thoại Doubao" là một ví dụ điển hình về sự điều chỉnh chiến lược. Phiên bản 1.0 ban đầu cố gắng thao tác các ứng dụng khác thông qua quyền hệ thống, nhưng sau đó đã chuyển hướng. Phiên bản 2.0 sắp tới được cho là sẽ đàm phán hợp tác và mở API với các nền tảng lớn như Alibaba, minh chứng cho xu hướng chuyển từ "vượt rào" sang "hợp tác". Những vụ kiện này đánh dấu sự kết thúc của thời kỳ phát triển bùng nổ không kiểm soát của trợ lý AI và bắt đầu một kỷ nguyên cạnh tranh tuân thủ. Chi phí tuân thủ và ủy quýền kép dần trở thành tiêu chuẩn ngành, mang lại lợi ích cho người dùng. Các công ty trợ lý AI có quy mô lớn với nguồn lực để đàm phán sẽ có lợi thế hơn, trong khi các công ty nhỏ hơn có thể phải điều chỉnh mô hình. Ngay cả phần mềm mã nguồn mở cũng không được miễn trừ trách nhiệm. Việc xử lý các công ty tiên phong và cực đoan nhất cho thấy sự khôn ngoan trong quản lý, định hình lại các quy tắc trò chơi cho toàn ngành công nghiệp trợ lý AI.

marsbit35 phút trước

Vụ Án Đầu Tiên Về Tác Nhân Thông Minh, Phán Quyết Gì?

marsbit35 phút trước

Bị Google 'sa thải' vì một bài báo 14 trang, hơn 4000 người lên tiếng ủng hộ, 6 năm sau nhìn lại: Khi ấy bà ấy gần như đã dự đoán toàn bộ kỷ nguyên AI

Năm 2020, Timnit Gebru, trưởng nhóm AI đạo đức tại Google, đã bị sa thải sau một cuộc tranh cãi về bài báo học thuật “On the Dangers of Stochastic Parrots” do bà đồng tác giả. Bài báo cảnh báo về những rủi ro của mô hình ngôn ngữ lớn (LLM) vào thời điểm GPT-3 vừa ra mắt, và giờ đây nhiều cảnh báo đó đã thành hiện thực. Bài báo dài 14 trang đã dự báo chính xác hàng loạt vấn đề mà ngành AI hiện đang đối mặt: “Ảo giác” (Hallucination) khi mô hình tạo ra thông tin sai lệch; việc khuếch đại thành kiến xã hội có sẵn trong dữ liệu huấn luyện; tác động môi trường lớn từ việc tiêu thụ năng lượng; sự thiếu minh bạch về nội dung trong dữ liệu huấn luyện; và nguy cơ “sụp đổ mô hình” (Model Collapse) khi nội dung do AI tạo ra tràn ngập internet và lại trở thành dữ liệu đầu vào cho thế hệ AI tiếp theo. Vụ việc dẫn đến làn sóng phản đối từ hơn 4000 nhân viên và chuyên gia. Sau khi rời Google, Gebru thành lập Viện Nghiên cứu AI Phân tán (DAIR) để tiếp tục điều tra các vấn đề về công bằng, đạo đức và quyền lực tập trung trong AI. Sáu năm sau, bài báo từng gây tranh cãi của bà được ghi nhận vì đã tiên tri chính xác những thách thức cốt lõi của kỷ nguyên AI ngày nay.

marsbit36 phút trước

Bị Google 'sa thải' vì một bài báo 14 trang, hơn 4000 người lên tiếng ủng hộ, 6 năm sau nhìn lại: Khi ấy bà ấy gần như đã dự đoán toàn bộ kỷ nguyên AI

marsbit36 phút trước

Giao dịch

Giao ngay

Hợp đồng Tương lai

Bài viết Nổi bật

Làm thế nào để Mua CORE

Chào mừng bạn đến với HTX.com! Chúng tôi đã làm cho mua CORE (CORE) trở nên đơn giản và thuận tiện. Làm theo hướng dẫn từng bước của chúng tôi để bắt đầu hành trình tiền kỹ thuật số của bạn.Bước 1: Tạo Tài khoản HTX của BạnSử dụng email hoặc số điện thoại của bạn để đăng ký tài khoản miễn phí trên HTX. Trải nghiệm hành trình đăng ký không rắc rối và mở khóa tất cả tính năng. Nhận Tài khoản của tôiBước 2: Truy cập Mua Crypto và Chọn Phương thức Thanh toán của BạnThẻ Tín dụng/Ghi nợ: Sử dụng Visa hoặc Mastercard của bạn để mua CORE (CORE) ngay lập tức.Số dư: Sử dụng tiền từ số dư tài khoản HTX của bạn để giao dịch liền mạch.Bên thứ ba: Chúng tôi đã thêm những phương thức thanh toán phổ biến như Google Pay và Apple Pay để nâng cao sự tiện lợi.P2P: Giao dịch trực tiếp với người dùng khác trên HTX.Thị trường mua bán phi tập trung (OTC): Chúng tôi cung cấp những dịch vụ được thiết kế riêng và tỷ giá hối đoái cạnh tranh cho nhà giao dịch.Bước 3: Lưu trữ CORE (CORE) của BạnSau khi mua CORE (CORE), lưu trữ trong tài khoản HTX của bạn. Ngoài ra, bạn có thể gửi đi nơi khác qua chuyển khoản blockchain hoặc sử dụng để giao dịch những tiền kỹ thuật số khác.Bước 4: Giao dịch CORE (CORE)Giao dịch CORE (CORE) dễ dàng trên thị trường giao ngay của HTX. Chỉ cần truy cập vào tài khoản của bạn, chọn cặp giao dịch, thực hiện giao dịch và theo dõi trong thời gian thực. Chúng tôi cung cấp trải nghiệm thân thiện với người dùng cho cả người mới bắt đầu và người giao dịch dày dạn kinh nghiệm.

Tổng lượt xem 417Xuất bản vào 2024.12.13Cập nhật vào 2026.06.02

Thảo luận

Chào mừng đến với Cộng đồng HTX. Tại đây, bạn có thể được thông báo về những phát triển nền tảng mới nhất và có quyền truy cập vào thông tin chuyên sâu về thị trường. Ý kiến của người dùng về giá của CORE (CORE) được trình bày dưới đây.

For the First Time, Pure Human Video Pretrained VLA for Dexterous Manipulation: Deployable with Minimal Fine-Tuning Data

Tóm tắt

Bridging the Gap from Human Videos to Robot Data

3D Motion Annotation: Accurately Recovering Hand and Camera Trajectories

Atomic-Level Action Segmentation: Natural Segmentation Based on Velocity Minima

Instruction Annotation: Precise Action Description Combining 3D Trajectories

Achieving Powerful Zero-Shot Prediction and Real-World Generalization

1. Model Architecture: Combining VLM and Diffusion Action Experts

2. Zero-Shot Hand Action Prediction: Demonstrating Remarkable Ability in Unseen Environments

3. Real-Robot Dexterous Manipulation: Efficient Deployment with Minimal Fine-Tuning Data

The Hardware Core Supporting VITRA's Real-World Deployment

High-Precision URDF and Seamless Connection to Human Hand Action Space

Fully Direct-Drive Architecture and High-Frequency Response: Perfect Execution of Complex Dexterous Operations

Rich Sensor Array: Reserving Space for Future Multimodal Perception

The Scaling Law of Data Size

Câu hỏi Liên quan

Nội dung Liên quan

Huang Renxun 'Giải cứu' thị trường chứng khoán Hàn Quốc: Khóa chặt bộ nhớ SK Hynix, tình trạng thiếu chip sẽ còn kéo dài

Chỉ số Nasdaq giảm 4.2% trong một ngày, 'Thứ Sáu Đen tối' có chọc vỡ bong bóng thị trường chứng khoán Mỹ?

Chỉ số Nasdaq giảm 4,2% trong một ngày, 'Ngày Thứ Sáu Đen Tối' làm vỡ bong bóng thị trường chứng khoán Mỹ?

Vụ Án Đầu Tiên Về Tác Nhân Thông Minh, Phán Quyết Gì?

Bị Google 'sa thải' vì một bài báo 14 trang, hơn 4000 người lên tiếng ủng hộ, 6 năm sau nhìn lại: Khi ấy bà ấy gần như đã dự đoán toàn bộ kỷ nguyên AI

Giao dịch

Bài viết Nổi bật

Làm thế nào để Mua CORE

Thảo luận

Danh mục Phổ biến

Thẻ Nổi bật