DeepSeek's New Technology Ported to Apple Silicon, Mac Local LLM Accelerated by 60%

marsbit發佈於 2026-07-03更新於 2026-07-03

文章摘要

DeepSeek's newly open-sourced DSpark inference acceleration technology has been ported to Apple Silicon, yielding significant speedups for running large language models locally on Macs. The port, called mlx-dspark, was developed by engineer Abdur Rahim and supports models like Gemma-4 12B and Qwen3-4B. DSpark uses speculative decoding, where a smaller "draft" model proposes candidate tokens which are then verified in a batch by the target model. Rahim adapted this approach for Apple's MLX framework, implementing 4-bit quantization for the draft model. On an M4 Pro Mac, this resulted in generation speeds increasing by approximately 1.6x for Gemma-4 12B (to ~30 tok/s) and 1.4x for Qwen3-4B (to ~73 tok/s). Crucially, the port maintains bitwise identical output to the original models, including support for temperature sampling, not just greedy decoding. The project also integrated DFlash, an alternative block-based speculative decoding method from z-lab. Benchmarks show DFlash excels in predictable contexts like code/math tasks (achieving ~2.1x speedup), while DSpark's Markov head provides better performance for open-ended chat. The latest mlx-dspark version allows users to switch between these methods. The work demonstrates efficient, high-fidelity local LLM inference on consumer Apple hardware.

Kressey from the Aofeisi Quantum Bit | Official Account QbitAI

Just one week after DeepSeek open-sourced DSpark, it's been ported to Apple computers.

The ported version is called mlx-dspark, running the Gemma-4 12B and Qwen3-4B models.

After installation, the generation speed of these two models on Mac increased by 1.6x and 1.4x respectively.

More importantly, it achieved something most ported versions can't — the output is byte-for-byte identical to the original model, not a single character off.

In other words, speed is gained without sacrificing any quality.

The person behind this is Abdur Rahim, an engineer who tinkers with open-source projects in his spare time. He single-handedly created the first native Mac version since DSpark was open-sourced.

Mac Running LLMs, Speed Boost of 60%

For DeepSeek's DSpark, open-sourced on June 27th, the official figures show a speed improvement of 60% to 85% in server-side scenarios.

However, this technology initially only had implementations for data center GPUs, with no version adapted for Apple Silicon.

mlx-dspark is the first native Apple Silicon version of this technology.

The idea behind DSpark is to pair a smaller model to assist the target model. The small model first generates several candidate tokens in one go, then the target model verifies them all at once, accepting the correct ones and rejecting the wrong ones for re-guessing.

The cost of this step differs between data centers and Apple computers.

On data center GPUs, verifying a batch of candidate tokens is more like chartering a bus—the fare is fixed regardless of the number of passengers. Since decoding is already memory-bound, verifying a few more tokens hardly adds any time.

Apple Silicon is more like a metered taxi—the more candidate tokens verified, the higher the meter runs.

Rahim tested it practically. For Gemma-4 12B, each additional token verified costs about 14 milliseconds. He calculated this into a cost model, concluding that the speed ceiling on Apple Silicon is around 2.2x.

In short, Rahim ported this assisting small model from HuggingFace's checkpoint and paired it with the target models Gemma-4 12B and Qwen3-4B.

He also rebuilt the verification process within the MLX framework and quantized the weights to 4-bit.

As a result, on the M4 Pro, compared to Apple's official MLX tool, Gemma-4 12B's generation speed increased from 18.4 tok/s to about 30 tok/s, about 1.6x the original; Qwen3-4B increased from 52.9 tok/s to about 73 tok/s, about 1.4x the original.

Additionally, in mlx-dspark, Rahim did something most porting work doesn't.

Ported Version, High-Fidelity Reproduction Possible

Most versions that port large models locally only support greedy decoding, meaning they pick the highest probability token at each step.

In mlx-dspark, Rahim implemented the temperature sampling method originally described in the DSpark paper. The draft model provides candidate tokens, and the acceptance probability is min(1, p/q), with unaccepted parts resampled from the residual.

He personally verified that the output from this process strictly equals the exact distribution the target model would give at the same temperature, not a discounted approximation.

Most speculative decoding implementations only do the greedy version because verifying the correctness of greedy mode is simple—just compare byte-by-byte.

The extra step Rahim took was personally checking the output distribution generated in sampling mode to confirm it wasn't distorted.

What precision the target model responsible for verification should be was a pitfall he figured out through trial.

If the small model was paired with a base target model without instruction fine-tuning, only 47% of the candidate tokens passed verification; switching to the corresponding instruction-tuned version increased this ratio to 82%.

He also tested switching the target model to bf16 precision. The increase in verification cost outweighed the increase in acceptance rate, making it slower, so leaving the target model at 8-bit by default is most cost-effective.

The small model responsible for generating candidate tokens uses a different precision.

The draft model itself was compressed by him. After 4-bit quantization, it's only 1.8GB, easily fitting into memory, and runs without loss.

The result is that DSpark not only achieved acceleration but also successfully reproduced the 16% to 18% acceptance rate improvement mentioned in the paper on the device.

DFlash Also Integrated, Faster on Code Tasks

After the tweet was posted, a comment appeared in the replies. Jian Chen, one of the authors of the DFlash paper, asked if they could try his team's model.

DFlash is another speculative decoding scheme proposed in a paper published by z-lab in May. The team lead author is Zhijian Liu, an assistant professor at UCSD and simultaneously a research scientist at NVIDIA.

DFlash's approach is different from DSpark. It uses a single parallel "block diffusion" to denoise an entire block of 16 tokens, rather than guessing step-by-step with dependencies like DSpark.

Rahim got to work quickly.

Using a porting script written by Jian himself, he connected the z-lab released gemma4-12B-it-DFlash to the Gemma-4 target model in mlx-vlm. On the same Mac, he ran another head-to-head comparison against the DSpark he just tested.

On code and math tasks, DFlash's block decoding acceptance length reached 5.95 to 6.20, speed about 36 tok/s, achieving about 2.1x, beating DSpark.

However, DFlash generates an entire block of 16 tokens at once, but the target model may not accept all of them. The portion that actually passes verification is only a part, referred to in the industry as the "acceptance length"—it's not always possible to fill all 16.

Therefore, in scenarios like open chat where content is unpredictable, the acceptance length doesn't increase, the block isn't fully utilized, and DFlash's advantage doesn't show.

DSpark's Markov head exists precisely to address this same issue. Parallel generation of an entire block of tokens means positions further back are calculated independently, making them prone to misalignment. The Markov head adds a layer of dependency between these positions specifically to correct this.

The result is, in chat scenarios, DSpark is actually faster than DFlash.

The subsequently updated mlx-dspark v0.0.3 officially integrated the z-lab original DFlash into the package, adding a parameter to manually shorten DFlash's effective block length—use short blocks for chat scenarios, and still use the full 16-token block for code and math scenarios.

After this, the same Mac, the same package, can handle both chat and code/math tasks, no longer needing to switch between the DSpark and DFlash projects.

Rahim said in his tweet that the same method should also work on larger Qwen3-8B and 14B draft models.

Reference Links:[1]https://x.com/_ARahim_/status/2072021710602432577[2]https://github.com/ARahim3/mlx-dspark

This article is from the official WeChat account "QbitAI", author: Focus on Frontier Technology

你可能也喜歡

GWEI今日为何大涨18%？美国交易量、轧空风险等解析

在过去的24小时内，GWEI代币价格大幅上涨18%，表现超越整体加密货币市场。推动此次上涨的关键力量来自美国市场，美国交易平台的单日交易量飙升至209万美元，约占该代币总交易量的12.59%，显示出美国资本在主导本轮行情。然而，市场情绪并非全然乐观。永续期货市场的资金费率转为负值（-0.38%），表明大量资金押注价格下跌，空头头寸聚集。这与现货市场卖压显著减弱的趋势形成反差，后者显示更多投资者选择持有。这种期货与现货市场的背离，可能预示着当前涨势是空头挤压的早期阶段，即价格可能被暂时压低以触发止损，然后引发更剧烈的上行。从技术层面看，当前价格上方的订单簿中存在更强的流动性聚集区，这可能会吸引价格向其靠拢。同时，价格下方的订单深度较薄，这限制了任何潜在下跌的幅度。总结来说，GWEI的强劲上涨主要由美国交易量驱动，但期货市场的看空头寸暗示涨势的持续性存疑，需警惕短期回调与空头挤压风险。

ambcrypto1 小時前

ambcrypto1 小時前

'无法被审查'：旨在封锁 Ordinals 的比特币提案 BIP-110 仅获不足 1% 的支持

旨在打击网络垃圾交易的比特币软分叉提案BIP-110未能获得足够的矿工支持。截至7月4日（矿工激活软分叉的最后期限），其支持率微乎其微，在2016个区块中仅有10个表示支持，占总算力比例不足1%。该提案激活需在同期获得55%的区块支持。该提案由匿名开发者提出，旨在限制比特币交易中的非金融数据（如Ordinals和Runes协议下的图像、视频或大文本），以维护网络作为点对点价值传输的核心功能并降低转账成本。支持者认为这有助于遏制网络拥堵和费用上涨。然而，批评者警告，此举将史无前例地禁止现有交易格式，可能导致超过170万枚BTC无法转移，并破坏钱包的兼容性，损害比特币作为可靠货币的信任基础。Blockstream CEO亚当·巴克等人强调，比特币如同互联网一样“难以被审查”。鉴于矿工支持率极低，该提案在后续8月由用户激活的窗口期通过的前景黯淡。比特币社区在此议题上存在显著分歧。

ambcrypto2 小時前

'无法被审查'：旨在封锁 Ordinals 的比特币提案 BIP-110 仅获不足 1% 的支持

ambcrypto2 小時前

比特币价格达62K美元：为何CoinShares警告‘这看起来仍处于筑底早期阶段’

在经历长期下跌后，比特币价格开始企稳，但宏观因素仍阻碍其长期反弹。CoinShares报告指出，反弹由弱于预期的美国6月非农就业数据引发，该数据仅增加5.7万个工作岗位，远低于预期。同时失业率微降至4.2%，导致市场推迟对美联储加息的预期，两年期国债收益率下降，促使部分资金流向比特币等风险资产，助其从约5.7万美元低点回升。然而报告警告，不应将此视为美联储政策的根本转变。美联储6月会议维持利率不变，点阵图反而更趋鹰派，政策制定者现预计2026年底利率平均为3.8%，高于三个月前的预测。另一方面，持有超10万枚比特币的“鲸鱼”在2025年市场高点附近抛售约390亿美元，构成当年主要价格压力，但此类抛售在2026年已基本停止。尽管比特币交易所交易产品（ETP）今年出现约27亿美元净流出，但CoinShares认为这并非信心流失，资金主要流向了人工智能主题ETF。报告还指出，伊朗冲突等地缘政治不确定性、CLARITY法案年内通过希望减弱，以及Strategy的比特币持仓可能带来的供应压力，均是当前挑战。因此，当前市场仍处于筑底初期，而非新一轮明确上涨的开端。比特币现价约62494.63美元，日内上涨1.3%。未平仓合约自6月中旬以来稳步上升，表明交易者仍在建仓。但低价格与高未平仓合约并存意味着多空双方杠杆均在增加，可能加剧未来价格波动。

ambcrypto3 小時前

比特币价格达62K美元：为何CoinShares警告‘这看起来仍处于筑底早期阶段’

ambcrypto3 小時前

一周代币解锁：PUMP面向团队与投资人开始解锁。

本周主要代币解锁项目概况如下： 1. **Pump.fun** * 该项目是Solana链上的Meme币发射平台。 * 本次将面向团队与投资人解锁825亿枚代币，按当前价格计算，价值约1.3亿美元。 2. **Hyperliquid** * 该项目是一条旨在构建全链上开放式金融系统的高性能区块链。 * 本次将解锁45万枚代币，价值约3198万美元。两项目均已公布具体的代币释放时间曲线。

marsbit5 小時前

marsbit5 小時前

“可能无法解决结构性问题”——Galaxy反对摩根大通为Strategy提出的稀释计划

全球最大的比特币持仓公司Strategy近期公布了新的12.5亿美元比特币出售计划，旨在筹集资金应对其利息义务。该计划已筹集10亿美元现金，并建立了12个月的现金储备缓冲，为其义务提供了约17个月覆盖期。市场对此反应积极，公司股票MSTR及优先股STRC价格均出现反弹。然而，银河数字研究主管Alex Thorn指出，该计划虽为明智之举，但可能无法永久解决根本的“结构性风险”。Strategy仍持有大量优先股，且未来两年将有67亿美元可转换债券到期，持续的大额义务构成压力。Thorn警告，出售BTC可能会加剧MSTR和STRC的弱势。 Thorn提出了一种“折中”方案：建议Strategy利用其庞大的比特币储备（847,363枚）通过比特币借贷或期权策略等有限方式产生现金流，而不是直接出售现货比特币或稀释MSTR股权。他认为，这既能解决现金流问题，又能避免损害股东利益及抛售核心资产。此方案类似于Metaplanet已采用的策略。相比之下，摩根大通的建议是通过出售更多MSTR股份（而非比特币）将现金缓冲期延长至2-3年。银河数字的方案则旨在不直接影响MSTR股东和比特币持仓的前提下，通过管理风险为Strategy创造更多覆盖义务的现金流。

ambcrypto5 小時前

“可能无法解决结构性问题”——Galaxy反对摩根大通为Strategy提出的稀释计划

ambcrypto5 小時前

交易

現貨

DeepSeek's New Technology Ported to Apple Silicon, Mac Local LLM Accelerated by 60%

文章摘要

Mac Running LLMs, Speed Boost of 60%

Ported Version, High-Fidelity Reproduction Possible

DFlash Also Integrated, Faster on Code Tasks

熱門幣種推薦

相關問答

你可能也喜歡

GWEI今日为何大涨18%？美国交易量、轧空风险等解析

'无法被审查'：旨在封锁 Ordinals 的比特币提案 BIP-110 仅获不足 1% 的支持

比特币价格达62K美元：为何CoinShares警告‘这看起来仍处于筑底早期阶段’

一周代币解锁：PUMP面向团队与投资人开始解锁。

“可能无法解决结构性问题”——Galaxy反对摩根大通为Strategy提出的稀释计划

交易

熱門文章

如何購買ONE

相關討論

熱門問答

熱門分類

熱門標籤