DeepSeek's New Technology Ported to Apple Silicon, Mac Local LLM Accelerated by 60%

marsbitОпубликовано 2026-07-03Обновлено 2026-07-03

Введение

DeepSeek's newly open-sourced DSpark inference acceleration technology has been ported to Apple Silicon, yielding significant speedups for running large language models locally on Macs. The port, called mlx-dspark, was developed by engineer Abdur Rahim and supports models like Gemma-4 12B and Qwen3-4B. DSpark uses speculative decoding, where a smaller "draft" model proposes candidate tokens which are then verified in a batch by the target model. Rahim adapted this approach for Apple's MLX framework, implementing 4-bit quantization for the draft model. On an M4 Pro Mac, this resulted in generation speeds increasing by approximately 1.6x for Gemma-4 12B (to ~30 tok/s) and 1.4x for Qwen3-4B (to ~73 tok/s). Crucially, the port maintains bitwise identical output to the original models, including support for temperature sampling, not just greedy decoding. The project also integrated DFlash, an alternative block-based speculative decoding method from z-lab. Benchmarks show DFlash excels in predictable contexts like code/math tasks (achieving ~2.1x speedup), while DSpark's Markov head provides better performance for open-ended chat. The latest mlx-dspark version allows users to switch between these methods. The work demonstrates efficient, high-fidelity local LLM inference on consumer Apple hardware.

Kressey from the Aofeisi Quantum Bit | Official Account QbitAI

Just one week after DeepSeek open-sourced DSpark, it's been ported to Apple computers.

The ported version is called mlx-dspark, running the Gemma-4 12B and Qwen3-4B models.

After installation, the generation speed of these two models on Mac increased by 1.6x and 1.4x respectively.

More importantly, it achieved something most ported versions can't — the output is byte-for-byte identical to the original model, not a single character off.

In other words, speed is gained without sacrificing any quality.

The person behind this is Abdur Rahim, an engineer who tinkers with open-source projects in his spare time. He single-handedly created the first native Mac version since DSpark was open-sourced.

Mac Running LLMs, Speed Boost of 60%

For DeepSeek's DSpark, open-sourced on June 27th, the official figures show a speed improvement of 60% to 85% in server-side scenarios.

However, this technology initially only had implementations for data center GPUs, with no version adapted for Apple Silicon.

mlx-dspark is the first native Apple Silicon version of this technology.

The idea behind DSpark is to pair a smaller model to assist the target model. The small model first generates several candidate tokens in one go, then the target model verifies them all at once, accepting the correct ones and rejecting the wrong ones for re-guessing.

The cost of this step differs between data centers and Apple computers.

On data center GPUs, verifying a batch of candidate tokens is more like chartering a bus—the fare is fixed regardless of the number of passengers. Since decoding is already memory-bound, verifying a few more tokens hardly adds any time.

Apple Silicon is more like a metered taxi—the more candidate tokens verified, the higher the meter runs.

Rahim tested it practically. For Gemma-4 12B, each additional token verified costs about 14 milliseconds. He calculated this into a cost model, concluding that the speed ceiling on Apple Silicon is around 2.2x.

In short, Rahim ported this assisting small model from HuggingFace's checkpoint and paired it with the target models Gemma-4 12B and Qwen3-4B.

He also rebuilt the verification process within the MLX framework and quantized the weights to 4-bit.

As a result, on the M4 Pro, compared to Apple's official MLX tool, Gemma-4 12B's generation speed increased from 18.4 tok/s to about 30 tok/s, about 1.6x the original; Qwen3-4B increased from 52.9 tok/s to about 73 tok/s, about 1.4x the original.

Additionally, in mlx-dspark, Rahim did something most porting work doesn't.

Ported Version, High-Fidelity Reproduction Possible

Most versions that port large models locally only support greedy decoding, meaning they pick the highest probability token at each step.

In mlx-dspark, Rahim implemented the temperature sampling method originally described in the DSpark paper. The draft model provides candidate tokens, and the acceptance probability is min(1, p/q), with unaccepted parts resampled from the residual.

He personally verified that the output from this process strictly equals the exact distribution the target model would give at the same temperature, not a discounted approximation.

Most speculative decoding implementations only do the greedy version because verifying the correctness of greedy mode is simple—just compare byte-by-byte.

The extra step Rahim took was personally checking the output distribution generated in sampling mode to confirm it wasn't distorted.

What precision the target model responsible for verification should be was a pitfall he figured out through trial.

If the small model was paired with a base target model without instruction fine-tuning, only 47% of the candidate tokens passed verification; switching to the corresponding instruction-tuned version increased this ratio to 82%.

He also tested switching the target model to bf16 precision. The increase in verification cost outweighed the increase in acceptance rate, making it slower, so leaving the target model at 8-bit by default is most cost-effective.

The small model responsible for generating candidate tokens uses a different precision.

The draft model itself was compressed by him. After 4-bit quantization, it's only 1.8GB, easily fitting into memory, and runs without loss.

The result is that DSpark not only achieved acceleration but also successfully reproduced the 16% to 18% acceptance rate improvement mentioned in the paper on the device.

DFlash Also Integrated, Faster on Code Tasks

After the tweet was posted, a comment appeared in the replies. Jian Chen, one of the authors of the DFlash paper, asked if they could try his team's model.

DFlash is another speculative decoding scheme proposed in a paper published by z-lab in May. The team lead author is Zhijian Liu, an assistant professor at UCSD and simultaneously a research scientist at NVIDIA.

DFlash's approach is different from DSpark. It uses a single parallel "block diffusion" to denoise an entire block of 16 tokens, rather than guessing step-by-step with dependencies like DSpark.

Rahim got to work quickly.

Using a porting script written by Jian himself, he connected the z-lab released gemma4-12B-it-DFlash to the Gemma-4 target model in mlx-vlm. On the same Mac, he ran another head-to-head comparison against the DSpark he just tested.

On code and math tasks, DFlash's block decoding acceptance length reached 5.95 to 6.20, speed about 36 tok/s, achieving about 2.1x, beating DSpark.

However, DFlash generates an entire block of 16 tokens at once, but the target model may not accept all of them. The portion that actually passes verification is only a part, referred to in the industry as the "acceptance length"—it's not always possible to fill all 16.

Therefore, in scenarios like open chat where content is unpredictable, the acceptance length doesn't increase, the block isn't fully utilized, and DFlash's advantage doesn't show.

DSpark's Markov head exists precisely to address this same issue. Parallel generation of an entire block of tokens means positions further back are calculated independently, making them prone to misalignment. The Markov head adds a layer of dependency between these positions specifically to correct this.

The result is, in chat scenarios, DSpark is actually faster than DFlash.

The subsequently updated mlx-dspark v0.0.3 officially integrated the z-lab original DFlash into the package, adding a parameter to manually shorten DFlash's effective block length—use short blocks for chat scenarios, and still use the full 16-token block for code and math scenarios.

After this, the same Mac, the same package, can handle both chat and code/math tasks, no longer needing to switch between the DSpark and DFlash projects.

Rahim said in his tweet that the same method should also work on larger Qwen3-8B and 14B draft models.

Reference Links:[1]https://x.com/_ARahim_/status/2072021710602432577[2]https://github.com/ARahim3/mlx-dspark

This article is from the official WeChat account "QbitAI", author: Focus on Frontier Technology

Трендовые криптовалюты

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

Связанные с этим вопросы

QWhat is the main achievement of the mlx-dspark project described in the article?

Amlx-dspark is the first native Apple Silicon port of DeepSeek's DSpark technology. It significantly speeds up the inference of models like Gemma-4 12B and Qwen3-4B on Macs (by ~1.6x and ~1.4x respectively) while maintaining output quality identical to the original models, byte-for-byte.

QHow does DSpark's speculative decoding method work to accelerate inference?

ADSpark uses a smaller 'draft' model to rapidly generate multiple candidate tokens (speculative decoding). The larger, target model then verifies these candidates in a batch. Correct tokens are accepted, and incorrect ones are rejected and regenerated. This reduces the number of times the slower target model needs to run.

QWhat key difference did Abdur Rahim implement in mlx-dspark that many other ports do not?

AUnlike many ports that only support greedy decoding, Rahim implemented the full temperature sampling method as described in the DSpark paper. This ensures the output distribution is mathematically identical to the target model's, not just an approximation, preserving generation quality.

QHow does the performance of DFlash compare to DSpark on Mac, according to the tests in the article?

ADFlash, which decodes in parallel blocks of 16 tokens, outperforms DSpark on code and math tasks, achieving speeds around 36 tok/s (~2.1x speedup). However, in open-ended chat scenarios where content is less predictable, DSpark's Markov head (which adds dependencies between candidate tokens) performs better than DFlash's block decoding.

QWhat practical feature was added in mlx-dspark v0.0.3 to handle different types of tasks?

AVersion 0.0.3 integrated the original DFlash model and added a parameter to manually adjust the effective block length. Users can use shorter blocks for chat scenarios and full 16-token blocks for code/math tasks, allowing a single package to handle different task types efficiently.

Похожее

Почему GWEI вырос на 18% сегодня? Американский объем, риски шорт-сквиза и не только…

Токен GWEI вырос на 18% за 24 часа, опередив общее восстановление крипторынка. Основным драйвером роста стал всплеск торгового объема со стороны американских инвесторов до $2,09 млн, что составило около 12,59% от общего объема торгов токеном. Общая торговая активность выросла на 9,53%, достигнув $16,58 млн. Однако данные по перпетуальным фьючерсам указывают на осторожность: отрицательное значение финансирования (-0,38%) свидетельствует о смещении позиций в сторону шорт-сделок, что может сигнализировать о рисках "короткого" сжатия (short squeeze). В то же время давление продаж на спотовом рынке заметно снизилось. Анализ ликвидности показывает, что ниже текущей цены ордеров мало, что может ограничить потенциальное падение. Основные скопления ликвидности находятся выше текущей цены, что со временем может притягивать к ней курс. Таким образом, краткосрочный импульс остается восходящим, но сохраняется риск коррекции.

ambcrypto23 мин. назад

Почему GWEI вырос на 18% сегодня? Американский объем, риски шорт-сквиза и не только…

ambcrypto23 мин. назад

'Не может быть подвергнуто цензуре' - Предложение по блокировке Ordinals в Bitcoin получило менее 1% поддержки

Предложение по мягкому форку Bitcoin BIP-110, направленное на блокировку нефинансовых данных, таких как Ordinals и Runes, не получило достаточной поддержки майнеров для активации. К финальному сроку 4 июля менее 1% хэшрейта сети (всего 10 из 2016 блоков) проголосовало за обновление, при необходимом пороге в 55%. Сторонники форка, такие как разработчик Люк Дэвис-младший, утверждали, что он необходим для защиты основной денежной функции Bitcoin, обеспечивая дешевые P2P-переводы и борясь со спамом. Однако критики, включая BitMEX Research и CEO Blockstream Адама Бэка, предупреждали, что BIP-110 нарушит работу кошельков, может заблокировать более 1.7 млн BTC и представляет собой попытку цензуры, противоречащую принципам неизменности и устойчивой сети. Бэк подчеркнул, что Bitcoin, как и интернет, практически невозможно подвергнуть цензуре. Финальное окно для активации форка пользователями ожидается в августе, но, учитывая минимальную поддержку майнеров, его шансы на успех считаются крайне низкими. Сообщество Bitcoin остается разделенным в этом вопросе.

ambcrypto1 ч. назад

'Не может быть подвергнуто цензуре' - Предложение по блокировке Ordinals в Bitcoin получило менее 1% поддержки

ambcrypto1 ч. назад

Биткоин на отметке $62 тысяч: почему CoinShares предупреждает, что ‘это все еще выглядит как начальная стадия формирования дна’

После продолжительного снижения Bitcoin начал стабилизироваться, однако макроэкономические факторы по-прежнему препятствуют долгосрочному восстановлению. Согласно отчету CoinShares, недавний отскок цены до уровня около $62 000 был спровоцирован слабым отчетом по занятости в США за июнь, что снизило ожидания скорого повышения ставок ФРС. Это привело к падению доходности казначейских облигаций, сделав криптоактивы относительно более привлекательными. Тем не менее, аналитики предупреждают, что это лишь краткосрочная реакция рынка, а не начало устойчивого роста. Денежно-кредитная политика ФРС остается сдержанной. Продажи крупными держателями (китами), оказавшие давление на рынок в 2025 году, в основном прекратились, но сохраняются другие проблемы: геополитическая неопределенность, отток капитала из биткойн-ETF в пользу AI-фондов и потенциальное давление со стороны продаж Strategy. Открытый интерес растет, что указывает на высокий уровень левериджа и риск резких ценовых движений. CoinShares заключает, что текущая ситуация больше похожа на раннюю стадию формирования дна, а не на начало нового устойчивого бычьего тренда.

ambcrypto2 ч. назад

Биткоин на отметке $62 тысяч: почему CoinShares предупреждает, что ‘это все еще выглядит как начальная стадия формирования дна’

ambcrypto2 ч. назад

PEPE обгоняет DOGE, но отстает от BONK – В чем истинная причина 14%-го роста

За последние 24 часа цена мемкоина PEPE выросла на 14,06%, достигнув $0,00002808, а его рыночная капитализация составила $1,16 млрд. Объем торгов увеличился на 49,6% до $274,5 млн, что указывает на возобновление интереса со стороны рынка. Рост в основном поддерживался спотовыми покупателями: объем спотовых торгов вырос на 80,46%, опередив рост фьючерсного объема (69,27%). Это говорит о притоке реального спроса, а не только о спекуляциях с плечом. Открытый интерес по фьючерсам также увеличился на 21,60%. Среди других мемкоинов PEPE показал себя лучше DOGE, но уступил BONK по темпам роста торговой активности. С технической точки зрения, PEPE восстановился после уровня поддержки $0,0000231 и приблизился к ключевому сопротивлению $0,0000300. Индекс относительной силы (RSI) находится на уровне 55,96, что указывает на улучшение покупательной способности без признаков перегрева. Успешный прорыв выше $0,0000300 может открыть путь к следующей цели в районе $0,0000400.

ambcrypto3 ч. назад

PEPE обгоняет DOGE, но отстает от BONK – В чем истинная причина 14%-го роста

ambcrypto3 ч. назад

Еженедельная разблокировка токенов: PUMP начинает разблокировку для команды и инвесторов.

**Pump.fun** Проект: Pump.fun - платформа для запуска мем-токенов в сети Solana. Ключевое событие: Начало разблокировки токенов для команды и инвесторов. Объем разблокировки: 82.5 млрд токенов PUMP. Примерная стоимость: Около 130 млн долларов США. Описание: Стартовал процесс вестинга, что может оказать влияние на рыночное предложение токена. **Hyperliquid** Проект: Hyperliquid - высокопроизводительный блокчейн, цель которого - создание полностью ончейн-открытой финансовой системы. Ключевое событие: Предстоящая разблокировка части токенов. Объем разблокировки: 450 тыс. токенов HYPE. Примерная стоимость: Около 31.98 млн долларов США. Описание: Разблокировка является частью запланированного графика выпуска токенов проекта. Оба события представляют собой значительные выпуски токенов на рынок в течение недели, что является важным фактором для инвесторов.

marsbit4 ч. назад

Еженедельная разблокировка токенов: PUMP начинает разблокировку для команды и инвесторов.

marsbit4 ч. назад

Торговля

Спот

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на ONE (ONE) представлены ниже.