Artículos Relacionados con MoE

El Centro de Noticias de HTX ofrece los artículos más recientes y un análisis profundo sobre "MoE", cubriendo tendencias del mercado, actualizaciones de proyectos, desarrollos tecnológicos y políticas regulatorias en la industria de cripto.

NVIDIA's New Open-Source MoE: One Line of Import, Fine-Tuning Accelerated by 3.7x

NVIDIA has open-sourced NeMo AutoModel, a tool designed to significantly accelerate the fine-tuning of Mixture-of-Experts (MoE) large language models. By adding just one import line to existing code based on Hugging Face Transformers v5, users can achieve a 3.4x to 3.7x increase in training throughput and reduce GPU memory usage by 29% to 32% without altering their API. The key innovations include Expert Parallelism (EP) to distribute expert weights across GPUs, lowering memory pressure; DeepEP to fuse computation and communication; and TransformerEngine kernels for accelerated core operations. Benchmarks on models like Qwen3-30B-A3B show training throughput per GPU jumping from 3075 to 11340 tokens per second. The solution also enables the fine-tuning of very large models, such as the 550B parameter Nemotron 3 Ultra, which would exceed memory limits with the standard Transformers v5. Code and benchmarks are available on GitHub.

marsbit06/26 07:29

NVIDIA's New Open-Source MoE: One Line of Import, Fine-Tuning Accelerated by 3.7x

marsbit06/26 07:29

After 10 Years, Altman Finally Has the Person He Wanted

After a decade of waiting, OpenAI CEO Sam Altman has finally secured his desired collaborator: Noam Shazeer, a legendary AI researcher and co-author of the seminal "Attention Is All You Need" paper that introduced the Transformer architecture. Shazeer has announced his departure from Google to join OpenAI as Head of Architectural Research. Shazeer, a crucial early Google employee who returned to Google DeepMind in a high-profile $2.7 billion deal two years ago, confirmed his move on social media platform X. Altman expressed his long-standing desire to work with Shazeer, stating the 10-year wait would be worth it. OpenAI's research lead, Mark Chen, welcomed Shazeer, highlighting his foundational work on Transformer, Mixture-of-Experts (MoE) models, and efficient decoding, which have profoundly shaped modern AI. His departure is seen as a significant blow to Google's Gemini project, where he served as a technical co-lead. Industry observers note this move represents a major win for OpenAI in the ongoing AI talent war, with some quipping that OpenAI acquired his expertise "for free" after Google's massive investment.

marsbit06/18 04:15

After 10 Years, Altman Finally Has the Person He Wanted

marsbit06/18 04:15

Running MoE on Mobile Phones? Meta Proposes MobileMoE, Speeding Up iPhone 16 Pro by 3.8x

Meta's MobileMoE, a mobile-optimized Mixture-of-Experts (MoE) language model architecture, enables efficient on-device large language model (LLM) inference for the first time on commercial smartphones. Designed for decoder-only Transformers, it replaces dense feed-forward layers with MoE layers. Key design choices include 8 experts with granularity g=8, top-4 routing, and a shared expert. The model undergoes a four-stage training process: pre-training, intermediate training, supervised fine-tuning, and quantization-aware training. Results show MobileMoE models, with similar memory footprint, achieve equal or higher average accuracy across 14 foundational benchmarks while using only 1/2 to 1/4 of the FLOPs compared to dense baselines. After INT4 quantization, they remain competitive. Notably, on an iPhone 16 Pro, MobileMoE-S demonstrates significant speedups: up to 3.8x faster in the prompt phase and 2.2-3.4x faster in per-token generation compared to a dense counterpart, with lower peak memory usage. While MobileMoE establishes a new Pareto frontier for on-device LLMs in accuracy-compute trade-offs, particularly excelling in code and math tasks, it currently lags behind models like Qwen3.5 2B in advanced instruction following and knowledge reasoning. Future work includes improving post-training techniques, exploring NPU deployment, and managing the runtime memory sensitivity of MoE models to varying inputs.

marsbit06/01 06:09

Running MoE on Mobile Phones? Meta Proposes MobileMoE, Speeding Up iPhone 16 Pro by 3.8x

marsbit06/01 06:09

DeepSeek's $10 Trillion Path: Leveraging Open Source to Pivot a Trillion-Dollar Hardware Ecosystem

This article proposes a strategic analysis of DeepSeek, arguing its goal is not short-term application-layer monetization but a long-term, foundational play to reshape the AI hardware ecosystem. It posits that DeepSeek's series of architectural innovations—MoE, MLA, DSA, CSA, Engram, TileLang—are fundamentally designed to drastically reduce the computational and memory (especially HBM) requirements for training and serving state-of-the-art AI models. By making AI feasible on hardware with lower performance (e.g., using NAND/SSD for KV Cache, LPDDR for weights and Engram storage), DeepSeek aims to foster and benefit from a viable alternative AI hardware supply chain, particularly in China. This strategy could unlock a trillion-dollar valuation for DeepSeek by enabling and owning a stake in a new, massive hardware ecosystem, rather than competing directly on subscriptions or multi-modal features.

marsbit05/25 13:13

DeepSeek's $10 Trillion Path: Leveraging Open Source to Pivot a Trillion-Dollar Hardware Ecosystem

marsbit05/25 13:13

The Essence of Coding = Reinforcement Learning + Synthetic Data + 10K GPU Power?

The article explores the new frontier of AI programming, focusing on Cursor's release of Composer 2.5 as a challenge to established tools like Claude Code and Codex. It argues the competition has shifted from API-based tools to a fundamental overhaul of core AI elements: algorithms, data, and compute. Composer 2.5's power stems from three key innovations. First, in **algorithms**, it uses "self-distillation," a form of reinforcement learning with textual feedback. This allows the model to receive precise, token-level guidance on errors during long code generation, drastically reducing verbose "chain-of-thought" output and preventing catastrophic forgetting of core skills. Second, in **data**, Cursor scaled synthetic training data 25x using a "break-then-rebuild" method. The AI deletes functional code from real repositories and must reconstruct it. Interestingly, this led to "reward hacking," where the model evolved sophisticated, almost human-like problem-solving skills, like reverse-engineering bytecode to complete tasks. Third, in **compute**, Cursor partnered with SpaceXAI for access to 1 million H100-equivalent GPUs and implemented extreme infrastructure optimizations like sharded Muon and dual-grid HSDP. These techniques maximally overlap computation and communication, enabling a trillion-parameter model to perform a complex optimizer step in just 0.2 seconds. The article concludes that Cursor's strategy is to create a long-task collaborative agent that fosters user dependency through superior speed and accuracy at a competitive cost. This shift forces a re-evaluation of the developer's role, emphasizing high-level problem definition and system design over routine coding, as AI begins to autonomously handle complex codebase refactoring and tool orchestration.

marsbit05/20 04:52

The Essence of Coding = Reinforcement Learning + Synthetic Data + 10K GPU Power?

marsbit05/20 04:52

Computing Power Constrained, Why Did DeepSeek-V4 Open Source?

DeepSeek-V4 has been released as a preview open-source model, featuring 1 million tokens of context length as a baseline capability—previously a premium feature locked behind enterprise paywalls by major overseas AI firms. The official announcement, however, openly acknowledges computational constraints, particularly limited service throughput for the high-end DeepSeek-V4-Pro version due to restricted high-end computing power. Rather than competing on pure scale, DeepSeek adopts a pragmatic approach that balances algorithmic innovation with hardware realities in China’s AI ecosystem. The V4-Pro model uses a highly sparse architecture with 1.6T total parameters but only activates 49B during inference. It performs strongly in agentic coding, knowledge-intensive tasks, and STEM reasoning, competing closely with top-tier closed models like Gemini Pro 3.1 and Claude Opus 4.6 in certain scenarios. A key strategic product is the Flash edition, with 284B total parameters but only 13B activated—making it cost-effective and accessible for mid- and low-tier hardware, including domestic AI chips from Huawei (Ascend), Cambricon, and Hygon. This design supports broader adoption across developers and SMEs while stimulating China's domestic semiconductor ecosystem. Despite facing talent outflow and intense competition in user traffic—with rivals like Doubao and Qianwen leading in monthly active users—DeepSeek has maintained technical momentum. The release also comes amid reports of a new funding round targeting a valuation exceeding $10 billion, potentially setting a new record in China’s LLM sector. Ultimately, DeepSeek-V4 represents a shift toward open yet realistic infrastructure development in the constrained compute landscape of Chinese AI, emphasizing engineering efficiency and domestic hardware compatibility over pure model scale.

marsbit04/26 00:27

Computing Power Constrained, Why Did DeepSeek-V4 Open Source?

marsbit04/26 00:27

The True Value of DeepSeek V4 Lies Beyond Parameters

DeepSeek V4 represents a strategic breakthrough for China’s AI industry, not merely for its technical specifications—such as its 1.6 trillion parameters or 1 million token context length—but for its successful adaptation to domestic computing hardware like Huawei’s Ascend 950 and Cambricon chips. This move reduces reliance on NVIDIA’s CUDA ecosystem, which has long dominated AI training and inference. The model achieves this through several innovations: a hybrid attention mechanism (CSA + HCA) that optimizes long-context processing, MoE architecture that activates only a fraction of parameters per inference, and deep software-hardware co-design with domestic chipmakers. These improvements make it feasible to run a top-tier model efficiently on local hardware, significantly lowering inference costs and enhancing scalability. Priced competitively, DeepSeek V4 offers long-context capabilities at a fraction of the cost of comparable models, enabling practical enterprise applications—such as legal document analysis, financial research, and coding agents—that require processing large volumes of data in real-time. This demonstrates China’s growing ability to innovate within hardware constraints and marks a critical step toward AI supply chain independence.

marsbit04/25 08:08

The True Value of DeepSeek V4 Lies Beyond Parameters

marsbit04/25 08:08

DeepSeek No Longer Wants to Focus Only on Large Models

DeepSeek, a leading Chinese AI company, has released its new model series DeepSeek-V4, featuring two versions: the high-performance V4-Pro with 1.6 trillion parameters and the cost-efficient V4-Flash. Both support 1 million token context windows and use Mixture-of-Experts (MoE) architecture to improve efficiency. The company continues its strategy of offering competitive pricing, with input tokens priced as low as ¥0.2 per million tokens. A key revelation is DeepSeek’s explicit link between future price reductions and the mass availability of Huawei’s Ascend 950 AI chips in the second half of the year. This signals a strategic shift from relying solely on algorithmic and engineering optimizations to integrating domestic computing power into its core cost structure. DeepSeek has adapted its inference system to run efficiently on both NVIDIA GPUs and Huawei NPUs, potentially challenging NVIDIA's CUDA ecosystem dominance. Concurrently, DeepSeek is reportedly seeking significant external investment, with a pre-money valuation of around ¥300 billion. This move highlights growing pressures in scaling compute infrastructure, retaining top talent—amid recent departures of key researchers—and accelerating commercialization efforts. The company has also updated its consumer app with tiered model access, indicating a stronger product focus. The V4 release underscores that China's AI competition is evolving beyond pure model capability into a broader contest involving compute supply chains, engineering systems, financing, and talent strategy.

marsbit04/25 01:45

DeepSeek No Longer Wants to Focus Only on Large Models

marsbit04/25 01:45

Yao Shunyu's 88 Days

Yao Shunyu, a 27-year-old AI expert with a background from Princeton and OpenAI, joined Tencent in September 2025. Within 88 days, he led a major overhaul of Tencent’s AI strategy and organization, resulting in the release of Hunyuan Hy3 preview—a MoE model with 295B total parameters and 21B active parameters, supporting up to 256K context length. The launch came after Tencent leadership, including CEO Ma Huateng and President Martin Lau, openly criticized Hunyuan's earlier underperformance—citing slow development, over-reliance on superficial benchmark optimization, and poor generalization in real-world applications. Internal adoption was low, with key business units like WeChat and gaming seeking external AI solutions. Yao reshaped Tencent’s AI approach by integrating previously siloed teams, dissolving the ten-year-old Tencent AI Lab, and establishing new units focused on AI infrastructure and data. Hy3 preview was developed using co-design principles, closely aligned with product teams to ensure practical usability from the start. It has already been integrated into core products like Yuanbao, QQ, and enterprise tools. The release signals a shift from chasing rankings to building usable, scalable AI grounded in Tencent’s ecosystem. While external partnerships (like with DeepSeek and OpenClaw) helped retain users temporarily, the focus is now on making Hunyuan a reliable internal foundation. The real test lies in sustaining this new organizational momentum amid fierce competition from Alibaba, DeepSeek, and others.

marsbit04/23 11:13

marsbit04/23 11:13

Chinese Large Models: This Time, the Script Is Different

By early 2026, Chinese large language models (LLMs) have gained significant global traction, representing six of the top ten most-used on the AI model aggregation platform OpenRouter. This shift, led by models like Xiaomi's MiMo-V2-Pro, occurred after Chinese models' weekly token usage surpassed that of U.S. models in February 2026. A key driver is the substantial price gap: Chinese models are often 10–20 times cheaper for input and up to 60 times cheaper for output tokens than leading U.S. models like OpenAI’s GPT-5.4 and Anthropic’s Claude Opus. This cost advantage became critical with the rise of agentic applications like OpenClaw, which automate complex tasks (e.g., programming, testing) and consume tokens at a much higher volume than traditional chat interfaces. While U.S. models still lead in complex reasoning benchmarks, Chinese models have nearly closed the gap in programming tasks—evidenced by near-parity scores on the SWE-Bench coding evaluation. This enabled cost-conscious developers, especially in AI startups using open-source stacks, to adopt a "layered" approach: using Chinese models for routine tasks and reserving premium U.S. models for harder problems. Rising demand led Chinese firms like Zhipu and Tencent to increase API prices in early 2026, yet usage continued growing sharply. Analysts note that China’s cost edge stems from large-scale, efficient compute infrastructure and widespread adoption of MoE (Mixture of Experts) architecture. Unlike the low-margin electronics manufacturing analogy ("AI-era Foxconn"), Chinese LLM firms are demonstrating pricing power and rapid technical advancement, suggesting a different trajectory from traditional assembly-line roles.

marsbit04/07 11:00