Running MoE on Mobile Phones? Meta Proposes MobileMoE, Speeding Up iPhone 16 Pro by 3.8x

marsbitОпубліковано о 2026-06-01Востаннє оновлено о 2026-06-01

Анотація

Meta's MobileMoE, a mobile-optimized Mixture-of-Experts (MoE) language model architecture, enables efficient on-device large language model (LLM) inference for the first time on commercial smartphones. Designed for decoder-only Transformers, it replaces dense feed-forward layers with MoE layers. Key design choices include 8 experts with granularity g=8, top-4 routing, and a shared expert. The model undergoes a four-stage training process: pre-training, intermediate training, supervised fine-tuning, and quantization-aware training. Results show MobileMoE models, with similar memory footprint, achieve equal or higher average accuracy across 14 foundational benchmarks while using only 1/2 to 1/4 of the FLOPs compared to dense baselines. After INT4 quantization, they remain competitive. Notably, on an iPhone 16 Pro, MobileMoE-S demonstrates significant speedups: up to 3.8x faster in the prompt phase and 2.2-3.4x faster in per-token generation compared to a dense counterpart, with lower peak memory usage. While MobileMoE establishes a new Pareto frontier for on-device LLMs in accuracy-compute trade-offs, particularly excelling in code and math tasks, it currently lags behind models like Qwen3.5 2B in advanced instruction following and knowledge reasoning. Future work includes improving post-training techniques, exploring NPU deployment, and managing the runtime memory sensitivity of MoE models to varying inputs.

In recent years, the Mixture of Experts (MoE) model has been widely used in large cloud-based models. However, on the mobile side, Large Language Models (LLMs) are still predominantly based on dense architectures. In the past, mobile devices imposed stricter constraints on memory, computing power, and latency, and there was a lack of systematic research on on-device MoE within the sub-billion active parameter range. Now, with the increase in DRAM capacity of mobile devices, MoE also has the opportunity to be deployed on smartphones.

The MobileMoE proposed by Meta's team achieves efficient MoE inference on commercial smartphones for the first time. The results show that across 14 foundational tests, MobileMoE-S/M, with similar memory usage, achieved comparable or even higher average accuracy using only 1/2 to 1/4 of the inference compute of dense baselines. In real-world tests, MobileMoE-S showed the most significant speedup on the iPhone 16 Pro's GPU/MLX backend, with a maximum speedup of 3.8x during the input phase.

Paper Link: https://arxiv.org/abs/2605.27358

The research team also proposed a set of on-device MoE scaling laws to determine model structures more suitable for mobile deployment. MobileMoE establishes a new Pareto frontier for on-device large language models, achieving better results in the trade-off between accuracy and inference computational overhead.

Figure | MobileMoE establishes a new Pareto frontier for on-device large language models.

How is MobileMoE Designed?

MobileMoE can be understood as follows: it is a class of MoE language models designed for on-device deployment. The overall architecture remains a decoder-only Transformer, but the original dense feed-forward layers are replaced with MoE layers. The router selects the top-scoring few experts for each token to participate in computation, while a shared expert always participates. The entire training process is divided into four steps: pretraining, intermediate training, supervised fine-tuning, and quantization-aware training.

Pretraining: The research team pretrained the model using approximately 6T tokens of open-licensed data with a context length of 2048. The data primarily consisted of web content, while also covering domains such as mathematics, code, knowledge, and science.

Intermediate Training: The research team extended the context length to 8192 and further increased the proportion of high-quality data in areas like knowledge, code, mathematics, and science, with a total scale of about 500B tokens.

Supervised Fine-Tuning (SFT): The research team fine-tuned MobileMoE-Base on open-licensed instruction-tuning data comprising over 80 million samples.

Quantization-Aware Training: The research team quantized linear layers and embeddings to INT4, activations to INT8 with dynamic quantization, while the router retained FP32 precision.

Figure | The four-stage training of MobileMoE.

Experimental Results

Ablation Study Results

The research team first compared three architectural variables: the number of experts E, expert granularity g, and the inclusion of a shared expert.

Figure | Scaling the number of experts E.

Under a fixed memory budget, when memory is above approximately 0.25GB, the loss of MoE begins to be lower than that of the corresponding dense model. Continuing to increase the number of experts E further reduces the loss, but the marginal gain significantly diminishes after E increases to 8. Experiments on expert granularity g indicate that finer-grained expert configurations are generally better, with g=8 achieving a good balance between effectiveness and training cost; when g increases from 8 to 16, the loss improvement is less than 0.01, but training time increases by about 50%. Under the same computational budget, the model loss further decreases after adding a shared expert.

Based on the ablation study results, the research team ultimately adopted the configuration with E=8, g=8, and a shared expert, i.e., 60 fine-grained routing experts, Top-4 routing, and 1 shared expert, and used this architecture for the three versions: MobileMoE-S, M, and L.

Figure | Scaling MoE models under compute-optimal conditions.

Figure | Training efficiency of the MoE architecture.

14 Foundational Evaluations: Establishing a New On-Device Pareto Frontier

The research team compared MobileMoE with models such as Gemma 3, SmolLM2, Qwen3.5, OLMo 2, and OLMoE-1B-7B under a unified setup across 14 foundational evaluations in five categories: commonsense reasoning, knowledge, science, reading, and reasoning.

Figure | Pretraining trajectory of MobileMoE.

Comparison results for Base models show that the average score of MobileMoE-M is higher than that of Qwen3.5 2B, and the average score of MobileMoE-L is higher than that of OLMoE-1B-7B, while also requiring a smaller model size. The team also noted that the average score of the Base version of MobileMoE-L is already higher than that of the Instruct version of OLMoE-1B-7B. In terms of training scale, MobileMoE uses about 6T pretraining tokens, which is less than Llama 3.2 1B's 9T and SmolLM2 1.7B's 11T. In the overall comparison of instruction-tuned models, the average accuracy of MobileMoE-M is already close to that of OLMoE-1B-7B, but with both active and total parameters reduced by about 60%.

Figure | Comparison of MobileMoE-Base models.

Advanced Evaluations: More Prominent Advantages in Code and Math Tasks

In advanced evaluations after instruction tuning, MobileMoE performs more prominently on code and math tasks. Taking MobileMoE-L as an example, its average scores in both code and math categories are higher than those of Qwen3.5 2B and OLMoE-1B-7B. However, the research team also notes that in terms of instruction-following and knowledge reasoning capabilities, Qwen3.5 2B remains stronger.

Figure | Comparison of Instruct models on advanced benchmarks.

Quantization and On-Device Deployment: Remains Competitive After INT4, Significant Speedup on Mobile

After quantization, the overall average scores of MobileMoE-S/M/L decreased compared to their respective BF16 versions, but the drops were roughly within 2 to 3 points. Even so, the performance of the INT4 version of MobileMoE-L remained higher than the BF16 version of OLMoE-1B-7B Instruct.

The research team also deployed MobileMoE on Samsung Galaxy S25 and iPhone 16 Pro for testing. The results show that under comparable INT4 weight memory conditions, MobileMoE-S, compared to MobileLLM-Pro, achieved speedups of 1.8-3.8x during the input phase and speedups of 2.2-3.4x during token-by-token generation.

In terms of memory usage, under conditions of Samsung Galaxy S25, 8K context length, and real prompts, the peak RSS of MobileMoE-S was 1.49GB, lower than MobileLLM-Pro's 1.91GB.

Figure | On-device runtime latency.

Limitations and Future Directions

Currently, in terms of higher-order instruction following, knowledge, and reasoning capabilities, the instruction-tuned MobileMoE still lags behind Qwen3.5 2B. The research team believes this gap may be related to more comprehensive post-training. In the future, to narrow this gap, training-side efforts should focus on strengthening distillation, inference-oriented post-training, and multimodal extension.

Furthermore, the research team points out that the memory footprint of MoE on mobile phones varies with input content. Compared to templated inputs, real inputs typically lead to higher memory usage. Testing solely based on templated inputs might underestimate the actual memory pressure in real deployment scenarios. In the future, to more accurately evaluate the real memory performance of on-device MoE, more real-world measurement data is needed.

At the same time, the research team has already completed systematic real-device testing on CPU and GPU backends, but the NPU path remains to be explored. Additionally, the runtime memory footprint of MoE is relatively sensitive to input content. In the future, dynamic routing, expert pruning, mixed-precision quantization, and mobile NPU deployment are all directions for further improving on-device efficiency.

For more technical details, please refer to the original paper.

This article is from the WeChat public account "Academic Headlines" (ID: SciTouTiao), Author: Xia Qiansi

Пов'язані питання

QWhat is MobileMoE and what problem does it address?

AMobileMoE is an efficient Mixture of Experts (MoE) language model designed by Meta for on-device deployment on smartphones. It addresses the challenge of deploying large language models (LLMs) with MoE architecture on mobile devices, which traditionally have stringent constraints on memory, compute power, and latency, making them predominantly use dense architectures. MobileMoE aims to provide competitive accuracy with significantly less computational cost than dense baselines.

QWhat key speed improvement did MobileMoE achieve on the iPhone 16 Pro?

AIn real-world testing, MobileMoE-S showed the most significant speed improvement on the iPhone 16 Pro's GPU/MLX backend, achieving up to a 3.8x speedup during the input processing phase.

QWhat are the main stages in the training process of MobileMoE?

AThe training process for MobileMoE consists of four main stages: 1) Pre-training on ~6T tokens with a 2048 context length, 2) Mid-training which extends the context to 8192 and uses ~500B higher-quality tokens, 3) Supervised Fine-Tuning (SFT) on over 80 million open-licensed instruction-following samples, and 4) Quantization-Aware Training (QAT), quantizing most layers to INT4/INT8 while keeping the router at FP32.

QWhat are the primary limitations or future directions mentioned for MobileMoE?

AThe main limitations are that MobileMoE still lags behind models like Qwen3.5 2B in higher-order instruction following and knowledge/reasoning capabilities. Future directions to address this include enhanced distillation, inference-oriented post-training, and multimodal extension. Additionally, MoE's runtime memory usage is sensitive to input content, and deploying on mobile NPUs remains an unexplored avenue for further efficiency gains.

QHow does MobileMoE's performance on code and mathematical tasks compare to other models in advanced evaluations?

AIn advanced evaluations after instruction fine-tuning, MobileMoE showed more pronounced advantages on code and mathematical tasks. For example, the MobileMoE-L model achieved higher average scores on code and math benchmarks compared to both Qwen3.5 2B and OLMoE-1B-7B.

Пов'язані матеріали

Silicon Valley 'Startup Guru' Steve Hoffman: Web3 + AI Could Be a Trap

Silicon Valley investor and "Godfather of Startups" Steve Hoffman warns that combining Web3 with AI is likely a trap, not a promising venture. In an interview, Hoffman argues that while AI is a foundational technology touching all industries, Web3 adds complexity, friction, and regulatory risk without solving mainstream consumer or business needs. He advises founders to focus on deep, specialized applications where startups can out-iterate giants, rather than on generic features easily replicated by large tech companies. Hoffman observes that Silicon Valley will lead foundational AI research, while China excels at rapid, large-scale application and commercialization, particularly in robotics. He stresses that AI-driven autonomous agents capable of collaborative, multi-step tasks are 2-4 years away, which will cause significant job displacement. The solution is not to slow AI but to redesign business models around human-AI collaboration and reform social systems like education and retraining. For startups, Hoffman recommends focusing on vertical, expertise-heavy domains to build defensibility. He sees major opportunities in AI fraud detection and cybersecurity. Key founder mindsets include systemic thinking over feature-focus, relentless customer centricity, building adaptive teams, and deeply understanding AI's capabilities and limits. Hoffman is also leading a non-profit initiative to establish university centers aimed at training future leaders in responsible, human-value-aligned AI innovation.

marsbit1 год тому

Silicon Valley 'Startup Guru' Steve Hoffman: Web3 + AI Could Be a Trap

marsbit1 год тому

Token Inefficient, Economy Tokenless

The article "Tokens Aren't Economical, Economics Aren't Tokenized" analyzes a pivotal shift in the AI industry from a technology-driven narrative to one dominated by capital efficiency. It highlights two concurrent trends: a severe capital shortage due to the exorbitant and recurring costs of compute (e.g., OpenAI's high burn rate) and a wave of corporate spin-offs where major tech companies are separating their AI units (like Kuaishou's Kling and Baidu's Kunlunxin). The core argument is that AI's "anti-internet" business model, where user growth increases costs rather than profits, has created a disconnect between high valuations and actual cash flow. Spin-offs address this by allowing AI assets to be valued independently. Within a parent company, they are seen as cost centers, but as standalone entities, they are priced based on their growth potential and scarcity in the primary market, leading to massive valuation premiums (e.g., Kling's estimated value tripling post-spin-off). The industry is at an inflection point, moving from "model worship" to "value realization." The competition is evolving from a pure compute (GPU) race to a broader focus on systemic efficiency and full-stack engineering (involving CPUs and orchestration) to achieve viable commercialization. The year 2026 is framed as a critical moment where the industry must definitively answer how to economically translate AI capability into tangible business value, reshaping the sector's future power structure.

marsbit1 год тому

Token Inefficient, Economy Tokenless

marsbit1 год тому

Crossing the 'Memory Wall': The Wafer-Level Revolution and Computing Power Routes in the AI Inference Era

In 2026, a historic shift occurred in AI as major cloud providers' inference spending surpassed training spending for the first time, signaling a move from "building large models" to "using large models." This shifts the core challenge from computing power to the "memory wall"—the bottleneck of data movement (model weights, activations, KV Cache) between external DRAM and processors, where energy and latency from data transfer far exceed computation itself. Companies like Nvidia face GPU idle time due to bandwidth limits. In contrast, Cerebras Systems adopts a radical "wafer-scale" approach with its Wafer-Scale Engine (WSE). Instead of cutting a silicon wafer into many chips, Cerebras uses almost the entire wafer as one massive chip (WSE-3). This design provides 44GB of on-chip SRAM, delivering memory bandwidth thousands of times higher than traditional HBM (e.g., 21 PB/s vs. Nvidia B200). For LLM inference, weights are streamed layer-by-layer from external MemoryX storage to the chip, avoiding HBM bottlenecks. This results in token generation speeds 1.5–5 times faster than Nvidia's B200 in some models and significant advantages in first-token latency and long-context tasks. Additionally, Cerebras's architecture offers much lower interconnect power consumption (0.15 pJ/bit vs. GPU's ~10 pJ/bit). However, Cerebras faces challenges: SRAM scaling has slowed with advanced nodes, limiting future capacity gains; the chip requires specialized liquid cooling and custom software stacks; and its external I/O bandwidth (150 GB/s) is low compared to NVLink, hindering multi-system scaling for very large models. Competition is intensifying. Major players are pursuing three paths: 1) Developing proprietary inference ASICs (e.g., Google TPU, Microsoft Maia), 2) Leveraging advanced packaging (e.g., TSMC's SoW) to democratize wafer-scale-like integration, potentially eroding Cerebras's process advantage within a few years, and 3) Exploring optical interconnects for ultimate bandwidth. Commercially, Cerebras is transitioning from a hardware vendor to a service provider, facing the immense challenge of building high-power, specialized data centers to meet large contracts (e.g., 250MW/year from 2026–2028). In conclusion, the AI inference era presents a fundamental architectural trade-off. Cerebras opts for extreme physical optimization for low-latency, single-task performance, while Nvidia prioritizes versatility and massive cluster throughput. The path forward remains uncertain, with technology and business models still evolving in the race toward advanced AI.

marsbit1 год тому

Crossing the 'Memory Wall': The Wafer-Level Revolution and Computing Power Routes in the AI Inference Era

marsbit1 год тому

Has Bitcoin's 'Rebound Ended', Officially Entering the Late Bear Market Phase?

**Title: Has Bitcoin's Rebound Ended, Entering the Late Bear Market Phase?** **Summary:** Bitcoin's price has declined by 13% this week, signaling a potential return to late-stage bear market conditions. The price fell to around $67k, positioned between the Realized Price and Realized Cap Weighted Average. For the first time since early 2022, the Short-Term Holder cost basis has dropped below this key average, confirming a hallmark of late-cycle bear markets. Profitability metrics have collapsed sharply. The 7-day average of the Realized Profit/Loss ratio plummeted from a local high of 3.16 to 0.29, mirroring the February panic sell-off. Critically, the 90-day average never breached the threshold of 2, indicating the recent rally to $82k was a bear market bounce, not a structural shift. Realized losses surged to $1.35 billion daily, with $770 million coming from Long-Term Holders selling at a loss. This accelerating redistribution of supply from weak to strong hands is a necessary but ongoing process for a market bottom. The rally stalled almost precisely at the aggregate cost basis (~$83k) of US spot Bitcoin ETF investors, turning that level into strong resistance and leaving the average ETF holder underwater again. Spot market flows have turned decisively negative, showing sellers are dominating order books despite the price drop. While a significant futures long liquidation event cleared over $400 million in leverage, providing a potential reset, sustained spot demand is yet to materialize. Options markets continue to price in higher future volatility (Implied Volatility) than recent price action (Realized Volatility) has shown, with a persistent skew towards put options, indicating ongoing demand for downside protection. In conclusion, multiple metrics point to a fragile market structure. Resistance at the ETF cost basis, accelerating realized losses, dominant spot selling, and cautious options pricing all suggest the bear market trend persists. A sustainable recovery likely requires a resurgence of spot demand, ETF holders returning to profit, and a clear reduction in selling pressure.

marsbit1 год тому

Has Bitcoin's 'Rebound Ended', Officially Entering the Late Bear Market Phase?

marsbit1 год тому

Торгівля

Спот
Ф'ючерси
活动图片