DeepSeek's New Technology Ported to Apple Silicon, Mac Local LLM Accelerated by 60%

marsbit發佈於 2026-07-03更新於 2026-07-03

文章摘要

DeepSeek's newly open-sourced DSpark inference acceleration technology has been ported to Apple Silicon, yielding significant speedups for running large language models locally on Macs. The port, called mlx-dspark, was developed by engineer Abdur Rahim and supports models like Gemma-4 12B and Qwen3-4B. DSpark uses speculative decoding, where a smaller "draft" model proposes candidate tokens which are then verified in a batch by the target model. Rahim adapted this approach for Apple's MLX framework, implementing 4-bit quantization for the draft model. On an M4 Pro Mac, this resulted in generation speeds increasing by approximately 1.6x for Gemma-4 12B (to ~30 tok/s) and 1.4x for Qwen3-4B (to ~73 tok/s). Crucially, the port maintains bitwise identical output to the original models, including support for temperature sampling, not just greedy decoding. The project also integrated DFlash, an alternative block-based speculative decoding method from z-lab. Benchmarks show DFlash excels in predictable contexts like code/math tasks (achieving ~2.1x speedup), while DSpark's Markov head provides better performance for open-ended chat. The latest mlx-dspark version allows users to switch between these methods. The work demonstrates efficient, high-fidelity local LLM inference on consumer Apple hardware.

Kressey from the Aofeisi Quantum Bit | Official Account QbitAI

Just one week after DeepSeek open-sourced DSpark, it's been ported to Apple computers.

The ported version is called mlx-dspark, running the Gemma-4 12B and Qwen3-4B models.

After installation, the generation speed of these two models on Mac increased by 1.6x and 1.4x respectively.

More importantly, it achieved something most ported versions can't — the output is byte-for-byte identical to the original model, not a single character off.

In other words, speed is gained without sacrificing any quality.

The person behind this is Abdur Rahim, an engineer who tinkers with open-source projects in his spare time. He single-handedly created the first native Mac version since DSpark was open-sourced.

Mac Running LLMs, Speed Boost of 60%

For DeepSeek's DSpark, open-sourced on June 27th, the official figures show a speed improvement of 60% to 85% in server-side scenarios.

However, this technology initially only had implementations for data center GPUs, with no version adapted for Apple Silicon.

mlx-dspark is the first native Apple Silicon version of this technology.

The idea behind DSpark is to pair a smaller model to assist the target model. The small model first generates several candidate tokens in one go, then the target model verifies them all at once, accepting the correct ones and rejecting the wrong ones for re-guessing.

The cost of this step differs between data centers and Apple computers.

On data center GPUs, verifying a batch of candidate tokens is more like chartering a bus—the fare is fixed regardless of the number of passengers. Since decoding is already memory-bound, verifying a few more tokens hardly adds any time.

Apple Silicon is more like a metered taxi—the more candidate tokens verified, the higher the meter runs.

Rahim tested it practically. For Gemma-4 12B, each additional token verified costs about 14 milliseconds. He calculated this into a cost model, concluding that the speed ceiling on Apple Silicon is around 2.2x.

In short, Rahim ported this assisting small model from HuggingFace's checkpoint and paired it with the target models Gemma-4 12B and Qwen3-4B.

He also rebuilt the verification process within the MLX framework and quantized the weights to 4-bit.

As a result, on the M4 Pro, compared to Apple's official MLX tool, Gemma-4 12B's generation speed increased from 18.4 tok/s to about 30 tok/s, about 1.6x the original; Qwen3-4B increased from 52.9 tok/s to about 73 tok/s, about 1.4x the original.

Additionally, in mlx-dspark, Rahim did something most porting work doesn't.

Ported Version, High-Fidelity Reproduction Possible

Most versions that port large models locally only support greedy decoding, meaning they pick the highest probability token at each step.

In mlx-dspark, Rahim implemented the temperature sampling method originally described in the DSpark paper. The draft model provides candidate tokens, and the acceptance probability is min(1, p/q), with unaccepted parts resampled from the residual.

He personally verified that the output from this process strictly equals the exact distribution the target model would give at the same temperature, not a discounted approximation.

Most speculative decoding implementations only do the greedy version because verifying the correctness of greedy mode is simple—just compare byte-by-byte.

The extra step Rahim took was personally checking the output distribution generated in sampling mode to confirm it wasn't distorted.

What precision the target model responsible for verification should be was a pitfall he figured out through trial.

If the small model was paired with a base target model without instruction fine-tuning, only 47% of the candidate tokens passed verification; switching to the corresponding instruction-tuned version increased this ratio to 82%.

He also tested switching the target model to bf16 precision. The increase in verification cost outweighed the increase in acceptance rate, making it slower, so leaving the target model at 8-bit by default is most cost-effective.

The small model responsible for generating candidate tokens uses a different precision.

The draft model itself was compressed by him. After 4-bit quantization, it's only 1.8GB, easily fitting into memory, and runs without loss.

The result is that DSpark not only achieved acceleration but also successfully reproduced the 16% to 18% acceptance rate improvement mentioned in the paper on the device.

DFlash Also Integrated, Faster on Code Tasks

After the tweet was posted, a comment appeared in the replies. Jian Chen, one of the authors of the DFlash paper, asked if they could try his team's model.

DFlash is another speculative decoding scheme proposed in a paper published by z-lab in May. The team lead author is Zhijian Liu, an assistant professor at UCSD and simultaneously a research scientist at NVIDIA.

DFlash's approach is different from DSpark. It uses a single parallel "block diffusion" to denoise an entire block of 16 tokens, rather than guessing step-by-step with dependencies like DSpark.

Rahim got to work quickly.

Using a porting script written by Jian himself, he connected the z-lab released gemma4-12B-it-DFlash to the Gemma-4 target model in mlx-vlm. On the same Mac, he ran another head-to-head comparison against the DSpark he just tested.

On code and math tasks, DFlash's block decoding acceptance length reached 5.95 to 6.20, speed about 36 tok/s, achieving about 2.1x, beating DSpark.

However, DFlash generates an entire block of 16 tokens at once, but the target model may not accept all of them. The portion that actually passes verification is only a part, referred to in the industry as the "acceptance length"—it's not always possible to fill all 16.

Therefore, in scenarios like open chat where content is unpredictable, the acceptance length doesn't increase, the block isn't fully utilized, and DFlash's advantage doesn't show.

DSpark's Markov head exists precisely to address this same issue. Parallel generation of an entire block of tokens means positions further back are calculated independently, making them prone to misalignment. The Markov head adds a layer of dependency between these positions specifically to correct this.

The result is, in chat scenarios, DSpark is actually faster than DFlash.

The subsequently updated mlx-dspark v0.0.3 officially integrated the z-lab original DFlash into the package, adding a parameter to manually shorten DFlash's effective block length—use short blocks for chat scenarios, and still use the full 16-token block for code and math scenarios.

After this, the same Mac, the same package, can handle both chat and code/math tasks, no longer needing to switch between the DSpark and DFlash projects.

Rahim said in his tweet that the same method should also work on larger Qwen3-8B and 14B draft models.

Reference Links:[1]https://x.com/_ARahim_/status/2072021710602432577[2]https://github.com/ARahim3/mlx-dspark

This article is from the official WeChat account "QbitAI", author: Focus on Frontier Technology

熱門幣種推薦

相關問答

QWhat is the main achievement of the mlx-dspark project described in the article?

Amlx-dspark is the first native Apple Silicon port of DeepSeek's DSpark technology. It significantly speeds up the inference of models like Gemma-4 12B and Qwen3-4B on Macs (by ~1.6x and ~1.4x respectively) while maintaining output quality identical to the original models, byte-for-byte.

QHow does DSpark's speculative decoding method work to accelerate inference?

ADSpark uses a smaller 'draft' model to rapidly generate multiple candidate tokens (speculative decoding). The larger, target model then verifies these candidates in a batch. Correct tokens are accepted, and incorrect ones are rejected and regenerated. This reduces the number of times the slower target model needs to run.

QWhat key difference did Abdur Rahim implement in mlx-dspark that many other ports do not?

AUnlike many ports that only support greedy decoding, Rahim implemented the full temperature sampling method as described in the DSpark paper. This ensures the output distribution is mathematically identical to the target model's, not just an approximation, preserving generation quality.

QHow does the performance of DFlash compare to DSpark on Mac, according to the tests in the article?

ADFlash, which decodes in parallel blocks of 16 tokens, outperforms DSpark on code and math tasks, achieving speeds around 36 tok/s (~2.1x speedup). However, in open-ended chat scenarios where content is less predictable, DSpark's Markov head (which adds dependencies between candidate tokens) performs better than DFlash's block decoding.

QWhat practical feature was added in mlx-dspark v0.0.3 to handle different types of tasks?

AVersion 0.0.3 integrated the original DFlash model and added a parameter to manually adjust the effective block length. Users can use shorter blocks for chat scenarios and full 16-token blocks for code/math tasks, allowing a single package to handle different task types efficiently.

你可能也喜歡

比特币价格达62K美元:为何CoinShares警告‘这看起来仍处于筑底早期阶段’

在经历长期下跌后,比特币价格开始企稳,但宏观因素仍阻碍其长期反弹。CoinShares报告指出,反弹由弱于预期的美国6月非农就业数据引发,该数据仅增加5.7万个工作岗位,远低于预期。同时失业率微降至4.2%,导致市场推迟对美联储加息的预期,两年期国债收益率下降,促使部分资金流向比特币等风险资产,助其从约5.7万美元低点回升。 然而报告警告,不应将此视为美联储政策的根本转变。美联储6月会议维持利率不变,点阵图反而更趋鹰派,政策制定者现预计2026年底利率平均为3.8%,高于三个月前的预测。 另一方面,持有超10万枚比特币的“鲸鱼”在2025年市场高点附近抛售约390亿美元,构成当年主要价格压力,但此类抛售在2026年已基本停止。尽管比特币交易所交易产品(ETP)今年出现约27亿美元净流出,但CoinShares认为这并非信心流失,资金主要流向了人工智能主题ETF。 报告还指出,伊朗冲突等地缘政治不确定性、CLARITY法案年内通过希望减弱,以及Strategy的比特币持仓可能带来的供应压力,均是当前挑战。因此,当前市场仍处于筑底初期,而非新一轮明确上涨的开端。 比特币现价约62494.63美元,日内上涨1.3%。未平仓合约自6月中旬以来稳步上升,表明交易者仍在建仓。但低价格与高未平仓合约并存意味着多空双方杠杆均在增加,可能加剧未来价格波动。

ambcrypto3 小時前

比特币价格达62K美元:为何CoinShares警告‘这看起来仍处于筑底早期阶段’

ambcrypto3 小時前

交易

現貨

熱門文章

如何購買ONE

歡迎來到HTX.com!在這裡,購買Harmony (ONE)變得簡單而便捷。跟隨我們的逐步指南,放心開始您的加密貨幣之旅。第一步:創建您的HTX帳戶使用您的 Email、手機號碼在HTX註冊一個免費帳戶。體驗無憂的註冊過程並解鎖所有平台功能。立即註冊第二步:前往買幣頁面,選擇您的支付方式信用卡/金融卡購買:使用您的Visa或Mastercard即時購買Harmony (ONE)。餘額購買:使用您HTX帳戶餘額中的資金進行無縫交易。第三方購買:探索諸如Google Pay或Apple Pay等流行支付方式以增加便利性。C2C購買:在HTX平台上直接與其他用戶交易。HTX 場外交易 (OTC) 購買:為大量交易者提供個性化服務和競爭性匯率。第三步:存儲您的Harmony (ONE)購買Harmony (ONE)後,將其存儲在您的HTX帳戶中。您也可以透過區塊鏈轉帳將其發送到其他地址或者用於交易其他加密貨幣。第四步:交易Harmony (ONE)在HTX的現貨市場輕鬆交易Harmony (ONE)。前往您的帳戶,選擇交易對,執行交易,並即時監控。HTX為初學者和經驗豐富的交易者提供了友好的用戶體驗。

685 人學過發佈於 2024.12.12更新於 2026.06.02

如何購買ONE

相關討論

歡迎來到 HTX 社群。在這裡,您可以了解最新的平台發展動態並獲得專業的市場意見。 以下是用戶對 ONE (ONE)幣價的意見。

活动图片