NVIDIA's New Open-Source MoE: One Line of Import, Fine-Tuning Accelerated by 3.7x

marsbitPublished on 2026-06-26Last updated on 2026-06-26

Abstract

NVIDIA has open-sourced NeMo AutoModel, a tool designed to significantly accelerate the fine-tuning of Mixture-of-Experts (MoE) large language models. By adding just one import line to existing code based on Hugging Face Transformers v5, users can achieve a 3.4x to 3.7x increase in training throughput and reduce GPU memory usage by 29% to 32% without altering their API. The key innovations include Expert Parallelism (EP) to distribute expert weights across GPUs, lowering memory pressure; DeepEP to fuse computation and communication; and TransformerEngine kernels for accelerated core operations. Benchmarks on models like Qwen3-30B-A3B show training throughput per GPU jumping from 3075 to 11340 tokens per second. The solution also enables the fine-tuning of very large models, such as the 550B parameter Nemotron 3 Ultra, which would exceed memory limits with the standard Transformers v5. Code and benchmarks are available on GitHub.

One line of import, fine-tuning of MoE large models accelerated by 3.7x.

NVIDIA's latest research is now open source: NeMo AutoModel, designed specifically for large-scale building and fine-tuning of generative AI models.

Built on top of Hugging Face Transformers v5, NeMo AutoModel achieves faster fine-tuning of MoE models without changing the code API—just by adding one line of import.

Experiments show that, compared to the original Hugging Face Transformers v5, NVIDIA's NeMo AutoModel can achieve a 3.4-3.7x increase in training throughput and reduce GPU memory usage by 29%-32% during MoE fine-tuning.

On a single node with 8x H100 80GB GPUs, using Qwen3-30B-A3B as an example, NeMo AutoModel directly increased the TPS/GPU (tokens per second per GPU) from 3075 to 11340, achieving a 3.69x improvement.

Core Technology Explained

MoE has become the mainstream architecture for cutting-edge models, but MoE also introduces new challenges for efficient training:

Expert parallelism, communication fusion, kernel optimization... these complex engineering tasks require supporting infrastructure.

HuggingFace's Transformers v5 is currently a widely used "universal base" for MoE training. V5 enhanced native support for MoE, introducing MoE foundational capabilities such as expert backends, dynamic weight loading, and distributed execution.

This time, NVIDIA's approach is to build on the shoulders of predecessors, maintaining compatibility with the HuggingFace Transformers API, allowing users to achieve higher training throughput and lower memory usage in MoE fine-tuning without significant code changes.

Specifically, NeMo AutoModel adds Expert Parallelism (EP), DeepEP, and TransformerEngine on top of Transformers v5.

Expert Parallelism

Expert Parallelism technology is primarily used to reduce memory pressure.

EP distributes expert weights across multiple GPUs; each GPU no longer holds all expert parameters entirely, but only holds a portion of them.

For example, with ep_size=8 across 8 GPUs, expert weights are distributed across 8 GPUs, reducing the MoE memory footprint on each GPU to 1/8 of the original.

Experimental results show that for Qwen3, this technology can reduce peak memory from 68.2 GiB to 48.1 GiB, a 29% reduction.

For the Nemotron Nanomo model, memory usage dropped from 62.1 GiB to 42.5 GiB, a 32% reduction.

The freed-up space can be used to support larger batch sizes and longer sequences.

DeepEP

DeepEP achieves the fusion of computation and communication.

In the traditional approach, there is significant communication cost between token distribution and expert computation. DeepEP integrates the token distribution and composition operations into optimized GPU kernels, overlapping the communication process with expert computation.

TransformerEngine

The TransformerEngine kernel provides acceleration for various core operations.

This technology offers implementations for fused attention mechanisms, linear layers, RMSNorm, etc., accelerating not only MoE layers but also regular Transformer layers.

One Line of Import, 3x Speed Boost

In summary, for those already using Transformers v5, NVIDIA's NeMo AutoModel offers a seamless upgrade path:

Just add one line of import code to achieve a 3x speed boost in MoE fine-tuning.

On Qwen3-30B-A3B and Nemotron 3 Nano 30B-A3B, compared to Transformers v5, this solution achieves a 3.4-3.7x increase in training throughput while reducing memory consumption by 29%-32%.

NVIDIA also demonstrated the results of full-parameter fine-tuning for Nemotron 3 Ultra 550B A55B on 16 H100 nodes with 128 GPUs.

The TPS/GPU was 815, TFLOP/s/GPU was about 293, and peak memory was 58.2 GiB.

The reason for not comparing it with v5 here is that Transformers v5 would simply run out of memory at this scale ̄_(ツ)_/ ̄

If you're interested, NVIDIA has already placed the code, configurations, and benchmark scripts on GitHub: https://github.com/NVIDIA-NeMo/Automodel/tree/blog/transformers-v5-automodel/blog_experiments

The detailed usage guide is here: https://docs.nvidia.com/nemo/automodel/latest/get-started/hf-compatibility

This article is from the WeChat public account "Qubit," author: Yu Yang

Trending Cryptos

Related Questions

QWhat is the key benefit of NVIDIA's newly open-sourced NeMo AutoModel for MoE model fine-tuning?

AThe key benefit is a significant performance improvement. NeMo AutoModel enables a 3.4-3.7x increase in training throughput and reduces GPU memory usage by 29-32% for MoE model fine-tuning, compared to the standard Hugging Face Transformers v5.

QHow does NeMo AutoModel achieve compatibility with existing Hugging Face Transformers v5 code?

ANeMo AutoModel achieves compatibility by maintaining the same API as Hugging Face Transformers v5. Users can integrate it into their existing code with minimal changes, often requiring only the addition of a single import statement to gain the performance benefits.

QWhat is Expert Parallelism (EP), and what problem does it solve?

AExpert Parallelism (EP) is a core technique in NeMo AutoModel that distributes the expert weights of a Mixture-of-Experts (MoE) model across multiple GPUs. This reduces the memory pressure on each individual GPU, as it no longer needs to hold all expert parameters. For example, on 8 GPUs, it can reduce MoE memory usage per GPU to about 1/8th of the original.

QWhat role does DeepEP play in the NeMo AutoModel architecture?

ADeepEP fuses computation with communication. It optimizes performance by integrating the token routing (distribution and combination) operations into optimized GPU kernels. This allows the communication process to overlap with expert computation, reducing traditional communication overhead and improving overall training efficiency.

QWhat results were demonstrated for the large-scale Nemotron 3 Ultra 550B model fine-tuning with NeMo AutoModel?

AFor the Nemotron 3 Ultra 550B A55B model fine-tuning on 128 H100 GPUs, NeMo AutoModel achieved a throughput of 815 tokens per second per GPU (TPS/GPU) and a peak memory usage of 58.2 GiB. The article notes that Transformers v5 could not handle this scale, as it would run out of memory.

Related Reads

South Korean Institutions' Crypto Race: Dual Explosion of Stablecoins and RWA

**Summary: South Korea's Institutional Crypto Race: Stablecoins and RWA Take Off** South Korea is undergoing a structural shift in its crypto ecosystem, moving beyond its historical role as a major retail trading hub. Major financial institutions and internet platforms are now building institutional-grade blockchain infrastructure, with stablecoins and Real-World Asset (RWA) tokenization as the primary drivers. The push for a regulated Korean won stablecoin market is a major policy and corporate focus. This is driven partly by an estimated $115 billion outflow into dollar stablecoins like USDC, threatening the domestic financial system. Banks (e.g., KB Financial, Hana), payment giants (e.g., Shinhan Card, BC Card), and internet super-apps (KakaoPay, NAVER Pay) are all conducting pilots. The goal is to anchor future digital finance to the Korean won and local regulations. In RWA, South Korea is advancing rapidly within regulatory sandboxes, focusing on unique domestic assets beyond typical global templates like US Treasuries. Projects involve tokenizing ships (with Hyundai Heavy Industries), defense supply chain assets, and K-pop intellectual property, alongside more conventional assets. A legal framework is set for 2027, and platforms like NXT are preparing for regulated trading. Key opportunities for crypto-native projects lie in providing the underlying technology these traditional institutions lack: global distribution channels for tokenized assets, cross-chain liquidity solutions, and enabling infrastructure tools (e.g., for asset packaging and management). Partnerships, such as Solana with Shinhan Card or LayerZero with the Korea Gold Exchange, exemplify this proactive approach. Crucially, user access is being shaped by consumer platforms. NAVER's planned acquisition of Upbit's operator Dunamu and Kakao's development of a unified wallet aim to seamlessly integrate crypto with everyday payments for tens of millions of users. The race is now about which protocols and projects will become the foundational standards as regulation solidifies and institutional adoption accelerates.

Foresight News40m ago

South Korean Institutions' Crypto Race: Dual Explosion of Stablecoins and RWA

Foresight News40m ago

How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

**How to Detect AI-Generated Videos: A Survey on Dynamic, Traceable, and Explainable Detection Systems** With rapid advances in AI video generation (e.g., Sora, Veo), creating highly realistic, multi-minute videos is now possible, widening the gap with detection research. Current AI video detection, often limited to unreliable binary classifications, is insufficient. This survey, accepted at ACL 2026, reframes the goal as **"factual fidelity verification"**—checking if a video's content (who, when, where, what) aligns with the real world perceptually and cognitively. It categorizes AI-generated videos into three paradigms: **Local Manipulation Videos (LMV**, e.g., face swaps), **Audio-Visual Editing (AVE**, e.g., lip-syncing), and **Generative Video Synthesis (GVS**, fully synthetic videos like Sora's). Detection challenges evolve from visual artifacts in LMV to multi-modal inconsistencies in AVE and higher-level world knowledge violations in GVS. The core proposal is a **Vision-Language Dual-View framework** with four hierarchical layers: 1. **Layer 1 (Intrinsic Visual Cues):** Analyzes low-level signal statistics, noise patterns, and physiological signals. 2. **Layer 2 (Spatiotemporal Consistency):** Checks for temporal coherence in object motion and scene dynamics. 3. **Layer 3 (Cross-Modal Consistency):** Verifies alignment between video, audio, and text within the video. 4. **Layer 4 (Language-Guided World-Level Reasoning):** Uses external knowledge, facts, and physical laws to judge semantic plausibility and factual correctness. The survey traces a shift in detection focus from lower layers (1 & 2) toward higher, language-involved layers (3 & 4). It also reviews evolving evaluation metrics and datasets tailored for each video paradigm. The conclusion advocates for a **dynamic, evidence-first detection system** that moves beyond simple classification. Future trustworthy detection requires combining visual evidence (from CV) with semantic reasoning and explanation (from NLP & multimodal AI), ultimately creating traceable and explainable judgments about a video's adherence to real-world constraints.

marsbit1h ago

How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

marsbit1h ago

It Turns Out the First Real-World Application of AI x Crypto is in Security Auditing

The article explores the surprising trend where AI's first major impact on crypto has been in security auditing, not in areas like trading or analytics. It details how AI-powered tools are dramatically lowering the barrier to finding smart contract vulnerabilities, enabling attackers to scan thousands of contracts and execute exploits within minutes. This has rendered traditional, manually-produced audit reports with their month-long validity periods increasingly obsolete, creating a critical "structural crack" in the old security model. Cases like Drift Protocol and KelpDAO show that even extensively audited protocols can be hacked through social engineering, operational flaws, or infrastructure misconfigurations beyond pure code review. Attackers are also using AI to find and exploit vulnerabilities in years-old, deployed contracts. Notably, OpenZeppelin's co-founder has expressed a grim view that "all DeFi is insecure" due to AI's asymmetric advantage. In response, the audit industry is undergoing a fundamental shift. While there's a short-term spike in defensive re-audits, the long-term business model is changing. Firms are developing AI-assisted systems and moving from one-time report deliveries towards embedded, continuous services like real-time monitoring and formal verification. Examples include AI tools uncovering critical, previously missed vulnerabilities in heavily audited protocols like Curve Finance and Zcash. The conclusion is that security must become a continuous investment, not a one-time checkbox, and audit firms must rapidly evolve their tools and service models to survive.

marsbit1h ago

It Turns Out the First Real-World Application of AI x Crypto is in Security Auditing

marsbit1h ago

Never expected that the first tangible application of AI x Crypto is in security auditing

Unexpectedly, the initial major application of AI in the Crypto sphere has turned out to be security auditing. In 2026, DeFi has faced significant security challenges, with 121 hacking incidents resulting in approximately $942 million in losses. While AI was expected to first impact areas like quantitative trading, its initial breakthrough has instead transformed security auditing by drastically lowering the cost and skill barrier for finding smart contract vulnerabilities. The traditional audit model is facing obsolescence. Advanced AI models, such as Claude Mythos, enable attackers to scan thousands of contracts and identify vulnerability patterns at scale, compressing the time from discovery to execution to mere minutes. This renders the month-long validity of traditional audit reports ineffective. Notably, attacks now frequently target well-audited, established protocols by exploiting business logic flaws, operational security weaknesses, and even years-old historical contracts, demonstrating that old audit reports offer zero protection. This pressure is forcing a fundamental shift in the industry. In the short term, a wave of defensive re-auditing is occurring, driven by projects seeking to meet new AI-era security standards and regulatory requirements. In the long run, audit firms' business models are diverging. The one-time report delivery model is declining in value, as evidenced by platforms like Code4rena shutting down. Leading firms are now pivoting towards AI-powered defense, integrating continuous monitoring, real-time on-chain risk detection, and embedding security directly into the development phase, as seen with tools like OpenZeppelin's Skills system. Ultimately, the era of "audit once, secure forever" is over. Security must become a continuous, embedded infrastructure investment for projects. For audit companies, survival depends on proactively transforming from traditional service providers into platforms offering AI-native, ongoing security solutions.

链捕手1h ago

Never expected that the first tangible application of AI x Crypto is in security auditing

链捕手1h ago

Trading

Spot

Hot Articles

How to Buy ONE

Welcome to HTX.com! We've made purchasing Harmony (ONE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Harmony (ONE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Harmony (ONE)After purchasing your Harmony (ONE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Harmony (ONE)Easily trade Harmony (ONE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

4.0k Total ViewsPublished 2024.03.29Updated 2026.06.02

How to Buy ONE

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of ONE (ONE) are presented below.

活动图片