NVIDIA's New Open-Source MoE: One Line of Import, Fine-Tuning Accelerated by 3.7x

marsbitPublicado em 2026-06-26Última atualização em 2026-06-26

Resumo

NVIDIA has open-sourced NeMo AutoModel, a tool designed to significantly accelerate the fine-tuning of Mixture-of-Experts (MoE) large language models. By adding just one import line to existing code based on Hugging Face Transformers v5, users can achieve a 3.4x to 3.7x increase in training throughput and reduce GPU memory usage by 29% to 32% without altering their API. The key innovations include Expert Parallelism (EP) to distribute expert weights across GPUs, lowering memory pressure; DeepEP to fuse computation and communication; and TransformerEngine kernels for accelerated core operations. Benchmarks on models like Qwen3-30B-A3B show training throughput per GPU jumping from 3075 to 11340 tokens per second. The solution also enables the fine-tuning of very large models, such as the 550B parameter Nemotron 3 Ultra, which would exceed memory limits with the standard Transformers v5. Code and benchmarks are available on GitHub.

One line of import, fine-tuning of MoE large models accelerated by 3.7x.

NVIDIA's latest research is now open source: NeMo AutoModel, designed specifically for large-scale building and fine-tuning of generative AI models.

Built on top of Hugging Face Transformers v5, NeMo AutoModel achieves faster fine-tuning of MoE models without changing the code API—just by adding one line of import.

Experiments show that, compared to the original Hugging Face Transformers v5, NVIDIA's NeMo AutoModel can achieve a 3.4-3.7x increase in training throughput and reduce GPU memory usage by 29%-32% during MoE fine-tuning.

On a single node with 8x H100 80GB GPUs, using Qwen3-30B-A3B as an example, NeMo AutoModel directly increased the TPS/GPU (tokens per second per GPU) from 3075 to 11340, achieving a 3.69x improvement.

Core Technology Explained

MoE has become the mainstream architecture for cutting-edge models, but MoE also introduces new challenges for efficient training:

Expert parallelism, communication fusion, kernel optimization... these complex engineering tasks require supporting infrastructure.

HuggingFace's Transformers v5 is currently a widely used "universal base" for MoE training. V5 enhanced native support for MoE, introducing MoE foundational capabilities such as expert backends, dynamic weight loading, and distributed execution.

This time, NVIDIA's approach is to build on the shoulders of predecessors, maintaining compatibility with the HuggingFace Transformers API, allowing users to achieve higher training throughput and lower memory usage in MoE fine-tuning without significant code changes.

Specifically, NeMo AutoModel adds Expert Parallelism (EP), DeepEP, and TransformerEngine on top of Transformers v5.

Expert Parallelism

Expert Parallelism technology is primarily used to reduce memory pressure.

EP distributes expert weights across multiple GPUs; each GPU no longer holds all expert parameters entirely, but only holds a portion of them.

For example, with ep_size=8 across 8 GPUs, expert weights are distributed across 8 GPUs, reducing the MoE memory footprint on each GPU to 1/8 of the original.

Experimental results show that for Qwen3, this technology can reduce peak memory from 68.2 GiB to 48.1 GiB, a 29% reduction.

For the Nemotron Nanomo model, memory usage dropped from 62.1 GiB to 42.5 GiB, a 32% reduction.

The freed-up space can be used to support larger batch sizes and longer sequences.

DeepEP

DeepEP achieves the fusion of computation and communication.

In the traditional approach, there is significant communication cost between token distribution and expert computation. DeepEP integrates the token distribution and composition operations into optimized GPU kernels, overlapping the communication process with expert computation.

TransformerEngine

The TransformerEngine kernel provides acceleration for various core operations.

This technology offers implementations for fused attention mechanisms, linear layers, RMSNorm, etc., accelerating not only MoE layers but also regular Transformer layers.

One Line of Import, 3x Speed Boost

In summary, for those already using Transformers v5, NVIDIA's NeMo AutoModel offers a seamless upgrade path:

Just add one line of import code to achieve a 3x speed boost in MoE fine-tuning.

On Qwen3-30B-A3B and Nemotron 3 Nano 30B-A3B, compared to Transformers v5, this solution achieves a 3.4-3.7x increase in training throughput while reducing memory consumption by 29%-32%.

NVIDIA also demonstrated the results of full-parameter fine-tuning for Nemotron 3 Ultra 550B A55B on 16 H100 nodes with 128 GPUs.

The TPS/GPU was 815, TFLOP/s/GPU was about 293, and peak memory was 58.2 GiB.

The reason for not comparing it with v5 here is that Transformers v5 would simply run out of memory at this scale ̄_(ツ)_/ ̄

If you're interested, NVIDIA has already placed the code, configurations, and benchmark scripts on GitHub: https://github.com/NVIDIA-NeMo/Automodel/tree/blog/transformers-v5-automodel/blog_experiments

The detailed usage guide is here: https://docs.nvidia.com/nemo/automodel/latest/get-started/hf-compatibility

This article is from the WeChat public account "Qubit," author: Yu Yang

Criptomoedas em alta

Perguntas relacionadas

QWhat is the key benefit of NVIDIA's newly open-sourced NeMo AutoModel for MoE model fine-tuning?

AThe key benefit is a significant performance improvement. NeMo AutoModel enables a 3.4-3.7x increase in training throughput and reduces GPU memory usage by 29-32% for MoE model fine-tuning, compared to the standard Hugging Face Transformers v5.

QHow does NeMo AutoModel achieve compatibility with existing Hugging Face Transformers v5 code?

ANeMo AutoModel achieves compatibility by maintaining the same API as Hugging Face Transformers v5. Users can integrate it into their existing code with minimal changes, often requiring only the addition of a single import statement to gain the performance benefits.

QWhat is Expert Parallelism (EP), and what problem does it solve?

AExpert Parallelism (EP) is a core technique in NeMo AutoModel that distributes the expert weights of a Mixture-of-Experts (MoE) model across multiple GPUs. This reduces the memory pressure on each individual GPU, as it no longer needs to hold all expert parameters. For example, on 8 GPUs, it can reduce MoE memory usage per GPU to about 1/8th of the original.

QWhat role does DeepEP play in the NeMo AutoModel architecture?

ADeepEP fuses computation with communication. It optimizes performance by integrating the token routing (distribution and combination) operations into optimized GPU kernels. This allows the communication process to overlap with expert computation, reducing traditional communication overhead and improving overall training efficiency.

QWhat results were demonstrated for the large-scale Nemotron 3 Ultra 550B model fine-tuning with NeMo AutoModel?

AFor the Nemotron 3 Ultra 550B A55B model fine-tuning on 128 H100 GPUs, NeMo AutoModel achieved a throughput of 815 tokens per second per GPU (TPS/GPU) and a peak memory usage of 58.2 GiB. The article notes that Transformers v5 could not handle this scale, as it would run out of memory.

Leituras Relacionadas

South Korean Institutions' Crypto Race: Dual Explosion of Stablecoins and RWA

**Summary: South Korea's Institutional Crypto Race: Stablecoins and RWA Take Off** South Korea is undergoing a structural shift in its crypto ecosystem, moving beyond its historical role as a major retail trading hub. Major financial institutions and internet platforms are now building institutional-grade blockchain infrastructure, with stablecoins and Real-World Asset (RWA) tokenization as the primary drivers. The push for a regulated Korean won stablecoin market is a major policy and corporate focus. This is driven partly by an estimated $115 billion outflow into dollar stablecoins like USDC, threatening the domestic financial system. Banks (e.g., KB Financial, Hana), payment giants (e.g., Shinhan Card, BC Card), and internet super-apps (KakaoPay, NAVER Pay) are all conducting pilots. The goal is to anchor future digital finance to the Korean won and local regulations. In RWA, South Korea is advancing rapidly within regulatory sandboxes, focusing on unique domestic assets beyond typical global templates like US Treasuries. Projects involve tokenizing ships (with Hyundai Heavy Industries), defense supply chain assets, and K-pop intellectual property, alongside more conventional assets. A legal framework is set for 2027, and platforms like NXT are preparing for regulated trading. Key opportunities for crypto-native projects lie in providing the underlying technology these traditional institutions lack: global distribution channels for tokenized assets, cross-chain liquidity solutions, and enabling infrastructure tools (e.g., for asset packaging and management). Partnerships, such as Solana with Shinhan Card or LayerZero with the Korea Gold Exchange, exemplify this proactive approach. Crucially, user access is being shaped by consumer platforms. NAVER's planned acquisition of Upbit's operator Dunamu and Kakao's development of a unified wallet aim to seamlessly integrate crypto with everyday payments for tens of millions of users. The race is now about which protocols and projects will become the foundational standards as regulation solidifies and institutional adoption accelerates.

Foresight NewsHá 29m

South Korean Institutions' Crypto Race: Dual Explosion of Stablecoins and RWA

Foresight NewsHá 29m

How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

**How to Detect AI-Generated Videos: A Survey on Dynamic, Traceable, and Explainable Detection Systems** With rapid advances in AI video generation (e.g., Sora, Veo), creating highly realistic, multi-minute videos is now possible, widening the gap with detection research. Current AI video detection, often limited to unreliable binary classifications, is insufficient. This survey, accepted at ACL 2026, reframes the goal as **"factual fidelity verification"**—checking if a video's content (who, when, where, what) aligns with the real world perceptually and cognitively. It categorizes AI-generated videos into three paradigms: **Local Manipulation Videos (LMV**, e.g., face swaps), **Audio-Visual Editing (AVE**, e.g., lip-syncing), and **Generative Video Synthesis (GVS**, fully synthetic videos like Sora's). Detection challenges evolve from visual artifacts in LMV to multi-modal inconsistencies in AVE and higher-level world knowledge violations in GVS. The core proposal is a **Vision-Language Dual-View framework** with four hierarchical layers: 1. **Layer 1 (Intrinsic Visual Cues):** Analyzes low-level signal statistics, noise patterns, and physiological signals. 2. **Layer 2 (Spatiotemporal Consistency):** Checks for temporal coherence in object motion and scene dynamics. 3. **Layer 3 (Cross-Modal Consistency):** Verifies alignment between video, audio, and text within the video. 4. **Layer 4 (Language-Guided World-Level Reasoning):** Uses external knowledge, facts, and physical laws to judge semantic plausibility and factual correctness. The survey traces a shift in detection focus from lower layers (1 & 2) toward higher, language-involved layers (3 & 4). It also reviews evolving evaluation metrics and datasets tailored for each video paradigm. The conclusion advocates for a **dynamic, evidence-first detection system** that moves beyond simple classification. Future trustworthy detection requires combining visual evidence (from CV) with semantic reasoning and explanation (from NLP & multimodal AI), ultimately creating traceable and explainable judgments about a video's adherence to real-world constraints.

marsbitHá 1h

How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

marsbitHá 1h

It Turns Out the First Real-World Application of AI x Crypto is in Security Auditing

The article explores the surprising trend where AI's first major impact on crypto has been in security auditing, not in areas like trading or analytics. It details how AI-powered tools are dramatically lowering the barrier to finding smart contract vulnerabilities, enabling attackers to scan thousands of contracts and execute exploits within minutes. This has rendered traditional, manually-produced audit reports with their month-long validity periods increasingly obsolete, creating a critical "structural crack" in the old security model. Cases like Drift Protocol and KelpDAO show that even extensively audited protocols can be hacked through social engineering, operational flaws, or infrastructure misconfigurations beyond pure code review. Attackers are also using AI to find and exploit vulnerabilities in years-old, deployed contracts. Notably, OpenZeppelin's co-founder has expressed a grim view that "all DeFi is insecure" due to AI's asymmetric advantage. In response, the audit industry is undergoing a fundamental shift. While there's a short-term spike in defensive re-audits, the long-term business model is changing. Firms are developing AI-assisted systems and moving from one-time report deliveries towards embedded, continuous services like real-time monitoring and formal verification. Examples include AI tools uncovering critical, previously missed vulnerabilities in heavily audited protocols like Curve Finance and Zcash. The conclusion is that security must become a continuous investment, not a one-time checkbox, and audit firms must rapidly evolve their tools and service models to survive.

marsbitHá 1h

It Turns Out the First Real-World Application of AI x Crypto is in Security Auditing

marsbitHá 1h

Never expected that the first tangible application of AI x Crypto is in security auditing

Unexpectedly, the initial major application of AI in the Crypto sphere has turned out to be security auditing. In 2026, DeFi has faced significant security challenges, with 121 hacking incidents resulting in approximately $942 million in losses. While AI was expected to first impact areas like quantitative trading, its initial breakthrough has instead transformed security auditing by drastically lowering the cost and skill barrier for finding smart contract vulnerabilities. The traditional audit model is facing obsolescence. Advanced AI models, such as Claude Mythos, enable attackers to scan thousands of contracts and identify vulnerability patterns at scale, compressing the time from discovery to execution to mere minutes. This renders the month-long validity of traditional audit reports ineffective. Notably, attacks now frequently target well-audited, established protocols by exploiting business logic flaws, operational security weaknesses, and even years-old historical contracts, demonstrating that old audit reports offer zero protection. This pressure is forcing a fundamental shift in the industry. In the short term, a wave of defensive re-auditing is occurring, driven by projects seeking to meet new AI-era security standards and regulatory requirements. In the long run, audit firms' business models are diverging. The one-time report delivery model is declining in value, as evidenced by platforms like Code4rena shutting down. Leading firms are now pivoting towards AI-powered defense, integrating continuous monitoring, real-time on-chain risk detection, and embedding security directly into the development phase, as seen with tools like OpenZeppelin's Skills system. Ultimately, the era of "audit once, secure forever" is over. Security must become a continuous, embedded infrastructure investment for projects. For audit companies, survival depends on proactively transforming from traditional service providers into platforms offering AI-native, ongoing security solutions.

链捕手Há 1h

Never expected that the first tangible application of AI x Crypto is in security auditing

链捕手Há 1h

Trading

Spot

Artigos em Destaque

Como comprar ONE

Bem-vindo à HTX.com!Tornámos a compra de Harmony (ONE) simples e conveniente.Segue o nosso guia passo a passo para iniciar a tua jornada no mundo das criptos.Passo 1: cria a tua conta HTXUtiliza o teu e-mail ou número de telefone para te inscreveres numa conta gratuita na HTX.Desfruta de um processo de inscrição sem complicações e desbloqueia todas as funcionalidades.Obter a minha contaPasso 2: vai para Comprar Cripto e escolhe o teu método de pagamentoCartão de crédito/débito: usa o teu visa ou mastercard para comprar Harmony (ONE) instantaneamente.Saldo: usa os fundos da tua conta HTX para transacionar sem problemas.Terceiros: adicionamos métodos de pagamento populares, como Google Pay e Apple Pay, para aumentar a conveniência.P2P: transaciona diretamente com outros utilizadores na HTX.Mercado de balcão (OTC): oferecemos serviços personalizados e taxas de câmbio competitivas para os traders.Passo 3: armazena teu Harmony (ONE)Depois de comprar o teu Harmony (ONE), armazena-o na tua conta HTX.Alternativamente, podes enviá-lo para outro lugar através de transferência blockchain ou usá-lo para transacionar outras criptomoedas.Passo 4: transaciona Harmony (ONE)Transaciona facilmente Harmony (ONE) no mercado à vista da HTX.Acede simplesmente à tua conta, seleciona o teu par de trading, executa as tuas transações e monitoriza em tempo real.Oferecemos uma experiência de fácil utilização tanto para principiantes como para traders experientes.

318 Visualizações TotaisPublicado em {updateTime}Atualizado em 2026.06.02

Como comprar ONE

Discussões

Bem-vindo à Comunidade HTX. Aqui, pode manter-se informado sobre os mais recentes desenvolvimentos da plataforma e obter acesso a análises profissionais de mercado. As opiniões dos utilizadores sobre o preço de ONE (ONE) são apresentadas abaixo.

活动图片