NVIDIA Team Enables Programming Agent to Take Over Real Robot Experiments, Achieves 99% Success Rate

marsbitОпубліковано о 2026-06-18Востаннє оновлено о 2026-06-18

Анотація

NVIDIA's ENPIRE project demonstrates fully automated robotic research where AI agents, given only high-level goals, autonomously manage the entire loop—from literature search and code development to training, deployment, and hardware iteration—on a fleet of physical robots. The system achieves 99% success rates on dexterous real-world tasks like cable tying and peg sorting. Key insights include the discovery of a "physical scaling law" where more parallel robots speed up task resolution, and the observation that resetting an environment is often easier than the main task. The framework introduces metrics like Mean Robot Utilization (MRU) to measure efficiency, with robots often idle half the time, waiting for agent decisions. The long-term vision is a lab that runs autonomously, even without human oversight. The project will be open-sourced, allowing developers to build similar systems.

Automated research has truly stepped out of the code sandbox and into the real physical world.

Recently, Jim Fan, lead of NVIDIA's GEAR lab, introduced a new project called ENPIRE. This marks their first implementation of automated research on robot hardware.

They placed 8 Codex Agents into a robot fleet, allocated GPU computing power and ample token budgets, and gave a simple goal: solve the task as quickly as possible, keep the robots busy but safe, and avoid wasting computing power.

After that, human intervention was largely withdrawn. The Agents autonomously drove the entire closed loop, including automatic scene resetting, literature review, idea implementation and infrastructure setup, policy training and deployment, self-verification, log analysis and code improvement, iterating continuously until high-precision dexterous tasks were reliably completed on real hardware, such as fastening cable ties, organizing pins into a box, installing GPUs, etc.

They also observed a "physical scaling law": increasing the number of parallel robots (e.g., from a few to 8) significantly sped up task resolution.

Currently, part of the lab's systems can perform self-iteration overnight without human intervention, with researchers only needing to check reports in the morning.

Jim Fan stated that the future goal is to allow team members to go on vacation with peace of mind, with even NVIDIA CEO Jensen Huang unaware that the lab is still running autonomously.

The ENPIRE project plans to be fully open-sourced, potentially enabling regular developers to set up similar autonomous robot research systems at home.

Project address: https://research.nvidia.com/labs/gear/enpire/

ENPIRE System Architecture: Four Modules Forming a Closed Loop

ENPIRE is a framework system designed for coding Agents, constructing a repeatable physical feedback loop through four core modules: the Environment module (EN) handles automatic resetting and verification; the Policy Improvement module (PI) initiates policy optimization; the Rollout module (R) supports policy evaluation on single or multiple robots in parallel; and the Evolution module (E) enables coding Agents to analyze logs, review literature, and improve training infrastructure and algorithm code to address failure modes.

This closed-loop system transforms real-world robot learning into an Agent-managed, controllable optimization process, thereby minimizing manual input while supporting fair ablation experiments across different training recipes and Agent variants.

Supported by ENPIRE, cutting-edge programming Agents can autonomously develop policies and achieve a 99% success rate on challenging real-world dexterous manipulation tasks like PushT, organizing pins into a pin box, and cutting cable ties with a cutter.

Key Finding: Resetting the Environment is Often Easier Than Completing the Task Itself

One key observation is: For many robot tasks, resetting the environment is often easier than completing the task itself.

Therefore, ENPIRE's approach is to first let the Agent build an automatic resetting environment via Code-as-Policy. Often, the so-called reset is essentially a pick-and-place task, solvable by Cap-X.

Subsequently, the agent writes a heuristic rule-based reward function. The research team then places this environment into a sandbox and initiates automated research by the Agent centered around scoring.

This also echoes Karpathy's definition of automated research: automated research here is not simply tuning a hyperparameter or modifying a small piece of code. The Agent explores different paradigms from the internet and rewrites anything that could potentially boost performance, including algorithms, training objectives, and even the data loader.

In the pin task, one Agent even wrote its own contact force safety controller, which proved more effective than merely adjusting several reinforcement learning parameters.

New Metrics MRU and MTU

ENPIRE's scalability depends on the size of the Agent team and computing resources, but here, the truly scarce resource is not GPUs, but robot time.

When the research team provided Agents with 8 robots instead of 1, the time needed for the pin task to achieve near-perfect performance was reduced from over 1.5 hours to about 40 minutes. These Agents coordinate via Git: sharing code, discarding suboptimal ideas, and autonomously selecting each other's best-performing runs.

This points to a larger shift: robotics research is becoming a work of environment design—building environments where coding Agents can conduct automated research; algorithmic work is shifting to a higher layer, towards constructing feedback loops that Agents can autonomously close.

And this loop compounds continuously: a skill mastered by an Agent today becomes the building block for constructing and resetting environments for more difficult tasks tomorrow. Capabilities bootstrap new capabilities.

Under this paradigm, the true hard constraint is the budget for real-world interaction.

Therefore, the research team proposed two metrics:

  • Mean Robot Utilization (MRU): The proportion of time robots are actually running experiments relative to total elapsed real-world time.
  • Mean Token Utilization (MTU): Measures the efficiency of Agents in converting tokens into research progress.

In their experiments, MRU consistently remained below 50%. That means the robots were idle half the time, waiting for the Agents to think. Therefore, a better harness and faster models directly translate into tangible benefits.

PushT is a long-standing robot manipulation benchmark. Typically, completing this task requires extensive human demonstration data plus hours of behavior cloning training.

However, they observed that Codex, Claude Code, and Kimi Code all "solved" this task in under 2 hours using a rule-based heuristic approach: without neural networks, without training, and without relying on any human data.

To enable more people to experiment with automated research in the physical world at home, they developed a full-stack system based on the @LeRobotHF SO-101 kit + NVIDIA Jetson Thor. This system can perform the PushT task.

Reference Links:

https://x.com/_wenlixiao/status/2066913334994358342

https://x.com/DrJimFan/status/2066921736369766762

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Yang Wen

Пов'язані питання

QWhat is the key achievement of the ENPIRE project as described in the article?

AThe ENPIRE project by NVIDIA's GEAR lab has, for the first time, successfully implemented automated research on physical robot hardware. This involved 8 Codex Agents autonomously driving a closed-loop process to solve complex, high-precision dexterous manipulation tasks like tying zip ties and organizing a pin box, achieving a success rate of 99% with minimal human intervention.

QWhat are the four core modules that make up the ENPIRE system's closed-loop architecture?

AThe four core modules are: the Environment (EN) module for automatic resetting and validation, the Policy Improvement (PI) module for initiating policy optimization, the Rollout (R) module for evaluating policies on single or multiple robots, and the Evolution (E) module where coding agents analyze logs, consult literature, and improve infrastructure and code to address failure modes.

QWhat key insight regarding task difficulty did the researchers discover, and how did ENPIRE leverage it?

AA key observation was that resetting a robot environment is often easier than completing the main task itself. ENPIRE leverages this by first having agents build an automatic reset environment (often a simple pick-and-place task) using Code-as-Policy. This automated reset capability is crucial for enabling continuous, unattended experimentation.

QAccording to the article, what new metrics were proposed to measure the efficiency of this automated research paradigm, and what does a low MRU indicate?

ATwo new metrics were proposed: Mean Robot Utilization (MRU), which measures the proportion of time robots are actively running experiments, and Mean Token Utilization (MTU), which measures how efficiently agent tokens are converted into research progress. In their experiments, MRU was below 50%, indicating robots spent over half their time idle, waiting for the agents to think, highlighting a bottleneck.

QWhat future vision did Jim Fan, leader of the NVIDIA GEAR lab, describe for the ENPIRE system?

AJim Fan's stated future goal is for the lab team to be able to go on vacation comfortably, with the system running so autonomously that even NVIDIA's CEO, Jensen Huang, would be unaware that the laboratory continues to operate and conduct research on its own.

Пов'язані матеріали

NVIDIA CPU Advances, China's RISC-V Responds: Semiconductor Deep Dive - Part Four

NVIDIA is set to launch its new Vera AI data center CPU in China as early as August, with high pricing. While this move offers a new option, it highlights China's continued dependence on foreign-controlled Arm architecture. In response, the Chinese semiconductor industry is increasingly turning to RISC-V as a strategic alternative for achieving high-performance computing autonomy. The article explores the concept of the "impossible triangle" in CPU development—balancing prosperity, control, and autonomy—and posits that RISC-V's open-source, modular nature offers a unique path to achieving all three. While RISC-V is already dominant in embedded systems, the focus is now shifting to data centers and AI workloads. China has become a global hotspot for RISC-V development, driven by AI-driven compute demand, supply chain concerns from export controls, cost benefits of open-source, and strong policy support. Multiple Chinese companies have reportedly crossed the key performance threshold of 15 SPECint per GHz, a benchmark for entering the high-performance CPU club. Progress extends beyond single-core benchmarks. Companies are developing complete computing subsystems, including commercial-grade coherent network-on-chip (NoC) technology and server processors with up to 40 cores that strictly adhere to the RVA23 standard to ensure software compatibility. Real-world applications are emerging in areas like video transcoding and edge AI. However, significant challenges remain. The RISC-V ecosystem faces fragmentation, immature toolchains and verification processes, and gaps in single-core performance and energy efficiency compared to mature x86 and Arm architectures. The formidable software moat, epitomized by NVIDIA's CUDA, is a long-term hurdle. In conclusion, while RISC-V cannot immediately replace offerings like NVIDIA's Vera, it represents a viable long-term path for China to develop a self-sufficient, high-performance CPU ecosystem. The journey is acknowledged to be long and arduous, requiring sustained effort to overcome technical and ecosystem challenges.

marsbit35 хв тому

NVIDIA CPU Advances, China's RISC-V Responds: Semiconductor Deep Dive - Part Four

marsbit35 хв тому

My Coding Betting Dashboard is Profiting, but Polymarket is Truly Not a Good Place for 'Arbitrage'

The author built a custom monitoring dashboard for Polymarket, a prediction market platform, and tested it with $1,600, achieving over 30% returns. However, the core argument is that Polymarket is not a good venue for traditional arbitrage. The dashboard has two main sections: a "Portfolio Dashboard" for tracking active positions with key metrics like total capital, P&L, and a risk-control module using a tier system (T1, T2, T3), and an "Opportunity Watchlist" for monitoring markets. The article details a critical structural trap in binary markets: a bet with a high perceived probability of success still carries a 100% loss risk if wrong. The author's T1/T2/T3 system is designed to manage this by limiting position sizes based on conviction and time horizon, emphasizing that high confidence should not equal high concentration. A key insight is the danger of "pseudo-diversification"—betting on different markets driven by the same underlying variable. The author concludes that Polymarket offers few true low-risk, arbitrage opportunities. It is instead a high-risk environment where wins can create a false sense of mastery, leading to large losses. The platform is better viewed as a training ground for honing judgment through disciplined, framework-driven betting rather than a reliable income source. The tools help transform intuition into structured, rule-based decisions to mitigate the risk of catastrophic errors.

marsbit3 год тому

My Coding Betting Dashboard is Profiting, but Polymarket is Truly Not a Good Place for 'Arbitrage'

marsbit3 год тому

WeChat AI Card Hands-On Guide: Has the AI Shopping Era Arrived?

**"WeChat AI Card" Practical Test Guide: Has the Era of AI Shopping Arrived?** WeChat has officially launched the "AI Exclusive Card," a feature integrated into its Workbuddy AI assistant. This card is designed to handle payments for AI-initiated purchases. Our hands-on test reveals it's not yet a tool for fully autonomous AI shopping, but rather a controlled payment layer for AI agents. The AI Card functions as an isolated sub-wallet within WeChat Pay. Users must bind the card and transfer funds into it from their main wallet. Crucially, every transaction requires explicit user confirmation via smartphone scan; AI cannot spend autonomously. Currently accessible through the Workbuddy agent, the card targets specific digital consumption scenarios: purchasing paid content (reports, data), calling paid APIs/tools, and subscribing to services. Its design prioritizes security and control by separating funds and mandating approval for each payment. We tested a real-world scenario: ordering bubble tea via Workbuddy using a "Meituan Life Assistant" skill. The process encountered multiple hurdles: high "skill" usage costs (exceeding daily free credits), and most importantly, while a payment was successfully initiated, the AI purchased an incorrect product (a mismatched group-buy coupon instead of the desired drink). This highlights the current limitation: the **AI Card only solves the payment step**. The broader challenge lies in the **AI agent's execution chain**—accurately understanding intent, navigating third-party platforms, selecting the right product, and ensuring proper fulfillment. The payment succeeded, but the purchase failed to meet the user's need. In conclusion, the WeChat AI Exclusive Card is a cautious, early-step experiment in AI commerce. It provides a secure, user-controlled payment method for agent interactions but is not yet capable of reliable, end-to-end complex purchases. For now, it's best used for low-value, low-risk digital services with careful user verification at each step. The vision of AI handling complete shopping tasks remains a work in progress.

marsbit6 год тому

WeChat AI Card Hands-On Guide: Has the AI Shopping Era Arrived?

marsbit6 год тому

Deconstructing Notion's Growth: From a Note-taking Tool to 100 Million Users—How Notion Built a Triple Growth Flywheel Through Product, Templates, and Community

Notion's growth from a niche note-taking tool to a platform with 100 million users is powered by three interconnected flywheels: Product-Led Growth (PLG), a Template Economy, and Community-Driven Growth. First, Notion's PLG strategy relies on a highly flexible, "plastic" product that users can adapt to countless personal and team workflows. Its freemium model lowers the barrier to entry, while features like page sharing and collaboration drive organic, usage-based viral growth as users naturally invite others. Second, the Template Economy solves the "blank page" problem. Templates, created by both Notion and its community, transform abstract product capabilities into concrete, copyable solutions for specific scenarios (e.g., project management, content calendars). This dramatically lowers activation costs for new users and fuels SEO-driven discovery. Third, a vibrant Community acts as a distributed growth engine. Users and official Ambassadors create tutorials, share use cases, and host local events. This community not only educates users but also fosters a sense of identity around pursuing "better ways of working," strengthening loyalty and enabling global, low-cost expansion. Together, these flywheels create a self-reinforcing ecosystem: a great product attracts users who create templates and community content, which in turn attracts more users and deepens engagement. This system allowed Notion to scale from individuals to teams and enterprises through a bottom-up adoption path. Looking ahead, AI integration promises to accelerate these flywheels further by making templates smarter and the platform a potential AI-native work operating system. Ultimately, Notion's defensible advantage is not just its features, but this deeply entrenched network of user assets, creators, and community trust.

marsbit6 год тому

Deconstructing Notion's Growth: From a Note-taking Tool to 100 Million Users—How Notion Built a Triple Growth Flywheel Through Product, Templates, and Community

marsbit6 год тому

Торгівля

Спот
Ф'ючерси
活动图片