NVIDIA Merilis MoE Baru: Tambah Satu Baris 'import', Kecepatan Fine-tuning Meningkat 3.7x

marsbitPubblicato 2026-06-26Pubblicato ultima volta 2026-06-26

Introduzione

Dengan hanya menambahkan satu baris import, NeMo AutoModel NVIDIA mempercepat fine-tuning model MoE hingga 3,7 kali lipat dan mengurangi penggunaan memori GPU sebesar 29%-32%. Solusi ini kompatibel dengan API Hugging Face Transformers v5, sehingga tidak perlu mengubah kode secara signifikan. Teknologi utamanya mencakup Expert Parallelism (EP) untuk mendistribusikan bobot ahli ke beberapa GPU, DeepEP untuk menggabungkan komputasi dan komunikasi, serta TransformerEngine untuk mempercepat operasi inti. Dalam pengujian pada model Qwen3-30B-A3B dan Nemotron 3 Nano 30B-A3B, throughput pelatihan meningkat 3,4-3,7 kali. Untuk model skala besar seperti Nemotron 3 Ultra 550B, solusi ini tetap dapat dijalankan tanpa kehabisan memori. Kode dan panduan telah tersedia open-source di GitHub NVIDIA.

Satu baris import, fine-tuning model besar MoE 3.7 kali lebih cepat.

Hasil penelitian terbaru NVIDIA kini tersedia sumber terbuka: NeMo AutoModel, dirancang khusus untuk membangun dan melakukan fine-tuning model AI generatif skala besar.

Dengan dasar Hugging Face Transformers v5, NeMo AutoModel mampu melakukan fine-tuning model MoE lebih cepat hanya dengan menambahkan satu baris import, tanpa mengubah kode atau API.

Eksperimen menunjukkan, dibandingkan dengan versi asli Hugging Face Transformers v5, NVIDIA NeMo AutoModel dapat mencapai peningkatan throughput pelatihan sebesar 3.4-3.7 kali dalam fine-tuning MoE, serta mengurangi penggunaan memori GPU sebesar 29%-32%.

Pada node tunggal dengan 8xH100 GPU 80GB, dengan contoh Qwen3-30B-A3B, NeMo AutoModel langsung meningkatkan TPS/GPU (throughput per detik per GPU) dari 3075 menjadi 11340, peningkatan mencapai 3.69 kali.

Analisis Inti Teknologi

MoE telah menjadi arsitektur utama model terkini, namun MoE juga membawa tantangan baru untuk pelatihan yang efisien:

Expert Parallelism, fusi komunikasi, optimisasi kernel... infrastruktur pendukung diperlukan untuk semua rekayasa kompleks ini.

HuggingFace Transformers v5 saat ini adalah "landasan umum" untuk pelatihan MoE yang banyak digunakan. V5 meningkatkan dukungan native untuk MoE, memperkenalkan kemampuan dasar MoE seperti expert backends, dynamic weight loading, dan eksekusi terdistribusi.

Kali ini, pendekatan NVIDIA adalah berdiri di atas pencapaian sebelumnya, kompatibel dengan API HuggingFace Transformers, sehingga memungkinkan pengguna untuk tidak banyak mengubah kode, namun mendapatkan throughput pelatihan yang lebih tinggi dan penggunaan memori yang lebih rendah dalam fine-tuning MoE.

Secara spesifik, NeMo AutoModel menambahkan Expert Parallelism (EP), DeepEP, dan TransformerEngine di atas Transformers v5.

Expert Parallelism (Paralelisme Ahli)

Teknologi Expert Parallelism terutama digunakan untuk mengurangi tekanan memori.

EP mendistribusikan bobot expert ke beberapa GPU, setiap GPU tidak lagi menyimpan seluruh parameter expert, tetapi hanya sebagian dari mereka.

Sebagai contoh, pada 8 GPU dengan ep_size=8, bobot expert didistribusikan ke 8 GPU, penggunaan memori MoE per GPU dapat turun menjadi 1/8 dari aslinya.

Dari hasil eksperimen, untuk Qwen3, teknologi ini dapat menurunkan memori puncak dari 68.2GiB menjadi 48.1GiB, penurunan 29%.

Untuk model Nemotron Nanomo, penggunaan memori turun dari 62.1 GiB menjadi 42.5 GiB, penurunan 32%.

Ruang yang dibebaskan dapat digunakan untuk mendukung ukuran batch yang lebih besar atau urutan yang lebih panjang.

DeepEP

DeepEP mencapai fusi komputasi dan komunikasi.

Dalam metode tradisional, ada biaya komunikasi yang jelas antara distribusi token dan komputasi expert. DeepEP mengintegrasikan operasi distribusi dan penggabungan token ke dalam kernel GPU yang dioptimalkan, mencapai tumpang tindih antara proses komunikasi dan komputasi expert.

TransformerEngine

Kernel TransformerEngine memberikan akselerasi untuk berbagai operasi inti.

Teknologi ini menyediakan implementasi fused untuk mekanisme perhatian, lapisan linier, dan RMSNorm, tidak hanya mempercepat lapisan MoE tetapi juga lapisan Transformer biasa.

Satu Baris 'import', Peningkatan Kecepatan 3 Kali Lipat

Kesimpulannya, bagi pengguna yang sudah menggunakan Transformers v5, NVIDIA NeMo AutoModel menawarkan solusi upgrade tanpa rasa sakit:

Cukup tambahkan satu baris kode import, untuk mendapatkan peningkatan kecepatan fine-tuning MoE 3 kali lipat.

Pada Qwen3-30B-A3B dan Nemotron 3 Nano 30B-A3B, dibandingkan dengan Transformers v5, solusi ini dapat mencapai peningkatan throughput pelatihan 3.4-3.7 kali, sambil mengurangi konsumsi memori sebesar 29%-32%.

NVIDIA juga menunjukkan hasil fine-tuning parameter penuh untuk Nemotron 3 Ultra 550B A55B pada 16 node H100 dengan 128 GPU.

TPS/GPU adalah 815, TFLOP/s/GPU sekitar 293, memori puncak adalah 58.2GiB.

Alasan tidak ada perbandingan dengan v5 di sini adalah karena Transformers v5 pada skala ini akan langsung membuat memori meluap ̄_(ツ)_/ ̄

Jika tertarik, NVIDIA telah menyediakan kode, konfigurasi, dan skrip benchmark di GitHub: https://github.com/NVIDIA-NeMo/Automodel/tree/blog/transformers-v5-automodel/blog_experiments

Panduan penggunaan spesifik ada di sini: https://docs.nvidia.com/nemo/automodel/latest/get-started/hf-compatibility

Artikel ini berasal dari akun WeChat publik "Qubit", penulis: Yu Yang

Crypto di tendenza

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

PancakeSwapCAKE

Domande pertinenti

QApa keuntungan utama menggunakan NeMo AutoModel dari NVIDIA dalam fine-tuning model MoE?

AKeuntungan utamanya adalah peningkatan kecepatan fine-tuning hingga 3.4-3.7 kali lebih cepat dan pengurangan penggunaan memori GPU sebesar 29%-32%, hanya dengan menambahkan satu baris kode `import` tanpa mengubah kode yang ada.

QTeknologi inti apa saja yang ditambahkan oleh NeMo AutoModel di atas Transformers v5 untuk mencapai peningkatan kinerja tersebut?

ANeMo AutoModel menambahkan tiga teknologi inti: Expert Parallelism (EP) untuk mendistribusikan bobot ahli ke beberapa GPU, DeepEP untuk menggabungkan komputasi dan komunikasi, serta TransformerEngine untuk akselerasi kernel pada operasi inti seperti attention mechanism.

QBagaimana Expert Parallelism (EP) dalam NeMo AutoModel membantu menghemat memori GPU?

AExpert Parallelism mendistribusikan bobot para ahli (expert weights) model MoE ke beberapa GPU. Misalnya, dengan 8 GPU, setiap GPU hanya menyimpan 1/8 dari total parameter ahli, sehingga mengurangi beban memori per GPU secara signifikan, seperti yang ditunjukkan dengan penurunan dari 68.2GiB menjadi 48.1GiB untuk model Qwen3.

QModel apa saja yang diuji dalam artikel ini untuk menunjukkan peningkatan kinerja NeMo AutoModel?

AArtikel ini menguji peningkatan kinerja pada model Qwen3-30B-A3B dan Nemotron 3 Nano 30B-A3B untuk fine-tuning. Selain itu, juga ditunjukkan hasil fine-tuning penuh parameter pada model skala besar Nemotron 3 Ultra 550B A55B menggunakan 128 GPU H100.

QDi mana kita dapat menemukan kode, konfigurasi, dan pedoman penggunaan untuk NeMo AutoModel?

AKode, konfigurasi, dan skrip benchmark untuk NeMo AutoModel tersedia di repositori GitHub NVIDIA: https://github.com/NVIDIA-NeMo/Automodel/tree/blog/transformers-v5-automodel/blog_experiments. Panduan penggunaan lengkap dapat ditemukan di: https://docs.nvidia.com/nemo/automodel/latest/get-started/hf-compatibility.

Letture associate

India’s USDT premium surges above 8.5% as regulatory pressure tightens supply

India's USDT premium has surged above 8.5% as regulatory pressure tightens the supply of stablecoins domestically. This has pushed the price of USDT to ₹102.88 against an official USD/INR rate of ₹94.65, widening the premium well beyond its typical 3-4% range. The supply contraction, driven by enforcement actions and increased oversight discouraging capital inflows, has created a significant imbalance. Demand from traders, cross-border users, and businesses remains strong for uses like payments and dollar-backed value storage, but compliance risks are hindering arbitrage and new USDT inflows. Market data shows high transaction counts but low volumes, with buy volume severely lagging sell volume, indicating constrained liquidity and reduced market-making capacity. Prolonged regulatory uncertainty risks sustaining elevated premiums, pushing activity toward informal channels or offshore liquidity. The summary concludes that while demand for USDT in India is resilient, clearer regulations and improved compliance pathways are needed to restore supply, efficient pricing, and market efficiency by narrowing the premium.

ambcrypto9 min fa

India’s USDT premium surges above 8.5% as regulatory pressure tightens supply

ambcrypto9 min fa

US CFTC Launches Broad Investigation into Polymarket, Is the Prediction Market Party Coming to an End?

The U.S. Commodity Futures Trading Commission (CFTC) is conducting a broad investigation into the prediction market platform Polymarket, focusing on its business practices including social media promotions. This follows a bipartisan letter from U.S. senators urging the CFTC to probe alleged fraudulent marketing tactics used to promote gambling-like products. The action coincides with a period of explosive growth for the prediction market sector, driven by events like the World Cup, with platforms like Kalshi and Robinhood reporting record trading volumes and revenue. The investigation signals a potential end to the sector's unregulated expansion and may lead to clearer federal oversight, particularly regarding investor protection and distinguishing prediction markets from traditional sports betting. The CFTC's move has also intensified a jurisdictional conflict with multiple U.S. states (including Kentucky and New York), which have sued platforms like Polymarket and Kalshi, accusing them of operating illegal sports betting and threatening state gambling tax revenues. Furthermore, the CME Group has sued the CFTC, challenging its approval of certain prediction market products. The report also highlights the political and capital interests intertwined with the industry. Donald Trump Jr. holds advisory and investment roles in both Kalshi and Polymarket, and the Trump administration has previously emphasized federal regulatory authority over these markets. The CFTC's investigation into Polymarket is framed as a step towards formalizing the industry's regulatory landscape, moving it from a phase of "wild growth" towards a more structured future.

marsbit1 h fa

US CFTC Launches Broad Investigation into Polymarket, Is the Prediction Market Party Coming to an End?

marsbit1 h fa

U.S. CFTC Launches Extensive Investigation into Polymarket, Is the Prediction Market Frenzy Season Cooling Down?

The U.S. Commodity Futures Trading Commission (CFTC) has launched a broad investigation into the prediction market platform Polymarket, focusing on its business practices including social media activities. This follows a bipartisan letter from U.S. senators urging the CFTC to probe allegations of paid influencer false marketing and fraudulent promotion of gambling-like products to American users. The investigation comes as the prediction market sector experiences explosive growth, largely driven by the World Cup. Weekly trading volumes have hit record highs, exceeding $14.4 billion, with platforms like Kalshi and Robinhood's new venture seeing significant activity. Major firms like Meta are also showing interest in the space. This regulatory scrutiny signals a potential end to the sector's "wild growth" phase. The CFTC's move also highlights an escalating jurisdictional conflict between federal regulators and state authorities. Over a dozen states, including Kentucky and New York, have sued platforms like Polymarket and Kalshi, accusing them of operating illegal sports betting, which threatens state gambling tax revenues. The CFTC is countersuing to assert its exclusive federal jurisdiction over these "event contracts" as derivatives. Furthermore, the CFTC's approval of Kalshi's Bitcoin perpetual futures contract has sparked a lawsuit from traditional exchange CME, alleging regulatory overreach. The political and capital landscape is intricate, with Donald Trump Jr. holding advisory roles and investments in both Kalshi and Polymarket. This connects capital, political influence, and regulatory bodies, suggesting the current investigation may be a step toward formalizing the industry's rules rather than halting its progress.

Odaily星球日报1 h fa

U.S. CFTC Launches Extensive Investigation into Polymarket, Is the Prediction Market Frenzy Season Cooling Down?

Odaily星球日报1 h fa

Father of Claude Code's Latest Assessment: Team Division of Labor Rewritten in the AI Era, These 'Five Types' Are Most in Demand

"In the era of AI reshaping software development, Anthropic's Claude Code team lead, Boris Cherny, proposes a future where traditional job titles dissolve. He identifies five fluid, behavior-based roles emerging in effective, AI-augmented teams: The Prototyper (generates disruptive ideas), The Builder (scales prototypes to production), The Sweeper (streamlines and refactors to combat bloat), The Growth (iterates on launched products for market fit), and The Maintainer (ensures long-term security and reliability). Crucially, these are not fixed positions. Individuals may span multiple roles depending on the project and its lifecycle stage. A designer might be a Prototyper and Sweeper; an engineer could be a Builder and Maintainer. Team composition should shift with product maturity: early-stage products need Prototypers, Builders, and Sweepers, while scaling products require more Builders, Growth roles, and Maintainers. The discussion highlights that role fluidity is key, as professionals often switch roles across different projects or as a single project evolves. While AI tools like Claude increasingly assist with tasks like building and sweeping, human adaptability and focus on goals over rigid job boundaries are seen as essential for future teams."

marsbit1 h fa

Father of Claude Code's Latest Assessment: Team Division of Labor Rewritten in the AI Era, These 'Five Types' Are Most in Demand

marsbit1 h fa

Shenzhen Robotics Is About to Ring the Bell Again

A wave of robotics companies from Shenzhen is making significant strides in capital markets and attracting investor attention. Leading the charge is Yuejiang Technology, a collaborative robotics firm founded in a small Nanshan apartment in 2015, which recently filed for an IPO on the ChiNext board after a successful listing on the Hong Kong Stock Exchange in 2024. Its journey from industrial to commercial applications like beverage shops exemplifies Shenzhen's path of rapid prototyping and tight integration of R&D, supply chains, and market needs. This activity is part of a broader surge. The city's "Eight Guardians" of robotics, including giants like Ubtech and up-and-comers like ZhiPingFang and ZiBianLiang, are in the spotlight, bolstering Shenzhen's claim as a "humanoid robotics capital." A key ecosystem driver is the "X-Day" roadshow platform in Xili Lake, Nanshan, which connects early-stage tech projects with funding and resources. The scene highlights a shift from pure industrial automation to robots designed for service, companionship, and everyday life, a transition where Shenzhen's deep hardware and manufacturing strengths provide a distinct advantage.

marsbit1 h fa

Shenzhen Robotics Is About to Ring the Bell Again

marsbit1 h fa

Trading

Spot

Articoli Popolari

Come comprare ONE

Benvenuto in HTX.com! Abbiamo reso l'acquisto di Harmony (ONE) semplice e conveniente. Segui la nostra guida passo passo per intraprendere il tuo viaggio nel mondo delle criptovalute.Step 1: Crea il tuo Account HTXUsa la tua email o numero di telefono per registrarti il tuo account gratuito su HTX. Vivi un'esperienza facile e sblocca tutte le funzionalità,Crea il mio accountStep 2: Vai in Acquista crypto e seleziona il tuo metodo di pagamentoCarta di credito/debito: utilizza la tua Visa o Mastercard per acquistare immediatamente HarmonyONE.Bilancio: Usa i fondi dal bilancio del tuo account HTX per fare trading senza problemi.Terze parti: abbiamo aggiunto metodi di pagamento molto utilizzati come Google Pay e Apple Pay per maggiore comodità.P2P: Fai trading direttamente con altri utenti HTX.Over-the-Counter (OTC): Offriamo servizi su misura e tassi di cambio competitivi per i trader.Step 3: Conserva Harmony (ONE)Dopo aver acquistato Harmony (ONE), conserva nel tuo account HTX. In alternativa, puoi inviare tramite trasferimento blockchain o scambiare per altre criptovalute.Step 4: Scambia Harmony (ONE)Scambia facilmente Harmony (ONE) nel mercato spot di HTX. Accedi al tuo account, seleziona la tua coppia di trading, esegui le tue operazioni e monitora in tempo reale. Offriamo un'esperienza user-friendly sia per chi ha appena iniziato che per i trader più esperti.

332 Totale visualizzazioniPubblicato il 2024.12.12Aggiornato il 2026.06.02

Discussioni

Benvenuto nella Community HTX. Qui puoi rimanere informato sugli ultimi sviluppi della piattaforma e accedere ad approfondimenti esperti sul mercato. Le opinioni degli utenti sul prezzo di ONE ONE sono presentate come di seguito.