DeepSeek's New Technology Ported to Apple Silicon, Mac Local LLM Accelerated by 60%

marsbitОпубліковано о 2026-07-03Востаннє оновлено о 2026-07-03

Анотація

DeepSeek's newly open-sourced DSpark inference acceleration technology has been ported to Apple Silicon, yielding significant speedups for running large language models locally on Macs. The port, called mlx-dspark, was developed by engineer Abdur Rahim and supports models like Gemma-4 12B and Qwen3-4B. DSpark uses speculative decoding, where a smaller "draft" model proposes candidate tokens which are then verified in a batch by the target model. Rahim adapted this approach for Apple's MLX framework, implementing 4-bit quantization for the draft model. On an M4 Pro Mac, this resulted in generation speeds increasing by approximately 1.6x for Gemma-4 12B (to ~30 tok/s) and 1.4x for Qwen3-4B (to ~73 tok/s). Crucially, the port maintains bitwise identical output to the original models, including support for temperature sampling, not just greedy decoding. The project also integrated DFlash, an alternative block-based speculative decoding method from z-lab. Benchmarks show DFlash excels in predictable contexts like code/math tasks (achieving ~2.1x speedup), while DSpark's Markov head provides better performance for open-ended chat. The latest mlx-dspark version allows users to switch between these methods. The work demonstrates efficient, high-fidelity local LLM inference on consumer Apple hardware.

Kressey from the Aofeisi Quantum Bit | Official Account QbitAI

Just one week after DeepSeek open-sourced DSpark, it's been ported to Apple computers.

The ported version is called mlx-dspark, running the Gemma-4 12B and Qwen3-4B models.

After installation, the generation speed of these two models on Mac increased by 1.6x and 1.4x respectively.

More importantly, it achieved something most ported versions can't — the output is byte-for-byte identical to the original model, not a single character off.

In other words, speed is gained without sacrificing any quality.

The person behind this is Abdur Rahim, an engineer who tinkers with open-source projects in his spare time. He single-handedly created the first native Mac version since DSpark was open-sourced.

Mac Running LLMs, Speed Boost of 60%

For DeepSeek's DSpark, open-sourced on June 27th, the official figures show a speed improvement of 60% to 85% in server-side scenarios.

However, this technology initially only had implementations for data center GPUs, with no version adapted for Apple Silicon.

mlx-dspark is the first native Apple Silicon version of this technology.

The idea behind DSpark is to pair a smaller model to assist the target model. The small model first generates several candidate tokens in one go, then the target model verifies them all at once, accepting the correct ones and rejecting the wrong ones for re-guessing.

The cost of this step differs between data centers and Apple computers.

On data center GPUs, verifying a batch of candidate tokens is more like chartering a bus—the fare is fixed regardless of the number of passengers. Since decoding is already memory-bound, verifying a few more tokens hardly adds any time.

Apple Silicon is more like a metered taxi—the more candidate tokens verified, the higher the meter runs.

Rahim tested it practically. For Gemma-4 12B, each additional token verified costs about 14 milliseconds. He calculated this into a cost model, concluding that the speed ceiling on Apple Silicon is around 2.2x.

In short, Rahim ported this assisting small model from HuggingFace's checkpoint and paired it with the target models Gemma-4 12B and Qwen3-4B.

He also rebuilt the verification process within the MLX framework and quantized the weights to 4-bit.

As a result, on the M4 Pro, compared to Apple's official MLX tool, Gemma-4 12B's generation speed increased from 18.4 tok/s to about 30 tok/s, about 1.6x the original; Qwen3-4B increased from 52.9 tok/s to about 73 tok/s, about 1.4x the original.

Additionally, in mlx-dspark, Rahim did something most porting work doesn't.

Ported Version, High-Fidelity Reproduction Possible

Most versions that port large models locally only support greedy decoding, meaning they pick the highest probability token at each step.

In mlx-dspark, Rahim implemented the temperature sampling method originally described in the DSpark paper. The draft model provides candidate tokens, and the acceptance probability is min(1, p/q), with unaccepted parts resampled from the residual.

He personally verified that the output from this process strictly equals the exact distribution the target model would give at the same temperature, not a discounted approximation.

Most speculative decoding implementations only do the greedy version because verifying the correctness of greedy mode is simple—just compare byte-by-byte.

The extra step Rahim took was personally checking the output distribution generated in sampling mode to confirm it wasn't distorted.

What precision the target model responsible for verification should be was a pitfall he figured out through trial.

If the small model was paired with a base target model without instruction fine-tuning, only 47% of the candidate tokens passed verification; switching to the corresponding instruction-tuned version increased this ratio to 82%.

He also tested switching the target model to bf16 precision. The increase in verification cost outweighed the increase in acceptance rate, making it slower, so leaving the target model at 8-bit by default is most cost-effective.

The small model responsible for generating candidate tokens uses a different precision.

The draft model itself was compressed by him. After 4-bit quantization, it's only 1.8GB, easily fitting into memory, and runs without loss.

The result is that DSpark not only achieved acceleration but also successfully reproduced the 16% to 18% acceptance rate improvement mentioned in the paper on the device.

DFlash Also Integrated, Faster on Code Tasks

After the tweet was posted, a comment appeared in the replies. Jian Chen, one of the authors of the DFlash paper, asked if they could try his team's model.

DFlash is another speculative decoding scheme proposed in a paper published by z-lab in May. The team lead author is Zhijian Liu, an assistant professor at UCSD and simultaneously a research scientist at NVIDIA.

DFlash's approach is different from DSpark. It uses a single parallel "block diffusion" to denoise an entire block of 16 tokens, rather than guessing step-by-step with dependencies like DSpark.

Rahim got to work quickly.

Using a porting script written by Jian himself, he connected the z-lab released gemma4-12B-it-DFlash to the Gemma-4 target model in mlx-vlm. On the same Mac, he ran another head-to-head comparison against the DSpark he just tested.

On code and math tasks, DFlash's block decoding acceptance length reached 5.95 to 6.20, speed about 36 tok/s, achieving about 2.1x, beating DSpark.

However, DFlash generates an entire block of 16 tokens at once, but the target model may not accept all of them. The portion that actually passes verification is only a part, referred to in the industry as the "acceptance length"—it's not always possible to fill all 16.

Therefore, in scenarios like open chat where content is unpredictable, the acceptance length doesn't increase, the block isn't fully utilized, and DFlash's advantage doesn't show.

DSpark's Markov head exists precisely to address this same issue. Parallel generation of an entire block of tokens means positions further back are calculated independently, making them prone to misalignment. The Markov head adds a layer of dependency between these positions specifically to correct this.

The result is, in chat scenarios, DSpark is actually faster than DFlash.

The subsequently updated mlx-dspark v0.0.3 officially integrated the z-lab original DFlash into the package, adding a parameter to manually shorten DFlash's effective block length—use short blocks for chat scenarios, and still use the full 16-token block for code and math scenarios.

After this, the same Mac, the same package, can handle both chat and code/math tasks, no longer needing to switch between the DSpark and DFlash projects.

Rahim said in his tweet that the same method should also work on larger Qwen3-8B and 14B draft models.

Reference Links:[1]https://x.com/_ARahim_/status/2072021710602432577[2]https://github.com/ARahim3/mlx-dspark

This article is from the official WeChat account "QbitAI", author: Focus on Frontier Technology

Трендові криптовалюти

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

PancakeSwapCAKE

Пов'язані питання

QWhat is the main achievement of the mlx-dspark project described in the article?

Amlx-dspark is the first native Apple Silicon port of DeepSeek's DSpark technology. It significantly speeds up the inference of models like Gemma-4 12B and Qwen3-4B on Macs (by ~1.6x and ~1.4x respectively) while maintaining output quality identical to the original models, byte-for-byte.

QHow does DSpark's speculative decoding method work to accelerate inference?

ADSpark uses a smaller 'draft' model to rapidly generate multiple candidate tokens (speculative decoding). The larger, target model then verifies these candidates in a batch. Correct tokens are accepted, and incorrect ones are rejected and regenerated. This reduces the number of times the slower target model needs to run.

QWhat key difference did Abdur Rahim implement in mlx-dspark that many other ports do not?

AUnlike many ports that only support greedy decoding, Rahim implemented the full temperature sampling method as described in the DSpark paper. This ensures the output distribution is mathematically identical to the target model's, not just an approximation, preserving generation quality.

QHow does the performance of DFlash compare to DSpark on Mac, according to the tests in the article?

ADFlash, which decodes in parallel blocks of 16 tokens, outperforms DSpark on code and math tasks, achieving speeds around 36 tok/s (~2.1x speedup). However, in open-ended chat scenarios where content is less predictable, DSpark's Markov head (which adds dependencies between candidate tokens) performs better than DFlash's block decoding.

QWhat practical feature was added in mlx-dspark v0.0.3 to handle different types of tasks?

AVersion 0.0.3 integrated the original DFlash model and added a parameter to manually adjust the effective block length. Users can use shorter blocks for chat scenarios and full 16-token blocks for code/math tasks, allowing a single package to handle different task types efficiently.

Пов'язані матеріали

Crypto rules are ‘not a favor,’ says SEC, but CLARITY Act still waits

SEC Chairman Paul Atkins defended the agency's push for clear crypto market regulations, stating that providing regulatory clarity is "not a favor" but a necessary requirement for markets to function. He emphasized the SEC's "historic steps" to modernize rules in response to calls to make the U.S. a crypto capital. Despite issuing staff guidance on topics like asset classification and ETFs, the SEC acknowledges past missteps that broke trust and aims to rebuild it through an orderly process for handling numerous filings. However, such guidance, lacking a foundation in codified law, remains vulnerable to legal challenges. Lasting clarity for the industry is seen as dependent on the passage of the CLARITY Act, a comprehensive crypto market structure bill. Although the bill has cleared a key committee, it still awaits a Senate floor vote. With the EU's MiCA framework now active, industry groups are urging the U.S. to pass the CLARITY Act to prevent innovation from moving overseas.

ambcrypto24 хв тому

Crypto rules are ‘not a favor,’ says SEC, but CLARITY Act still waits

ambcrypto24 хв тому

Bearish Clouds Gather as $2.13B in Bitcoin and Ethereum Options Expire

The cryptocurrency market faced a key event on July 3 as $2.13 billion in Bitcoin and Ethereum options expired. Data revealed a defensive sentiment, especially for Ethereum, where a high put-call ratio of 1.29 indicated traders were hedging against further price drops. Bitcoin's put-call ratio was 0.70. Market positioning was concentrated near key levels of $60,000 for Bitcoin and $1,700 for Ethereum. While Bitcoin briefly reclaimed the $60,000 level, analysts remain uncertain if this signals a sustained recovery. Broader market trends, including traditional finance and tokenized stocks, are also influencing sentiment. The options data suggests traders are cautious and preparing for continued volatility rather than a major bullish move. At the time of reporting, Bitcoin traded near $61,932 and Ethereum around $1,738, both with significant liquidations over the preceding 24 hours.

TheNewsCrypto49 хв тому

Bearish Clouds Gather as $2.13B in Bitcoin and Ethereum Options Expire

TheNewsCrypto49 хв тому

Hot Takes｜Why Did the Famous "Tech Lead" Dump All His Bitcoin? The "Investment Whiz Kid" is Here!

**Weekly Spicy Review: Tech Lead's Bitcoin Bust, Reddit Meme, and Trump's Crypto Cash** This week's "Spicy Review" covers three notable incidents from the crypto world. **1. A Tech Lead Learns the Hard Way:** A former Google and Meta technical lead, Patrick Shyu, went viral after revealing he was forced to liquidate all his Bitcoin holdings. He suffered massive losses due to excessive leverage during Bitcoin's sharp decline from $120k to $60k. He shared critical observations: crypto trading often hinges on attention, not fundamentals; Bitcoin lacks a stable source of public focus; the AI boom is diverting capital; and Bitcoin faces structural risks like centralization of code maintenance and quantum computing threats. Despite his short-term exit, he remains a long-term believer. **2. Reddit Roasts the "Investment Whiz":** A popular meme on Reddit's CryptoCurrency subreddit depicted MicroStrategy's Michael Saylor looking down from a balcony. The caption joked about his relentless focus on buying Bitcoin with corporate funds, contrasting with average investors' mundane concerns. The post sparked humorous commentary on his high-risk, high-conviction strategy. **3. Trump's $1.4 Billion Crypto Haul:** The White House's financial disclosure revealed former President Donald Trump earned at least $1.4 billion from cryptocurrency activities in a year, contributing to a total income of over $2.2 billion. This windfall stands in stark contrast to the performance of "TrumpCoin" (officially DJT), which plummeted over 97% from its peak, reportedly causing investor losses exceeding $2 billion. Critics, like California Governor Gavin Newsom, accused Trump of profiting while his supporters suffered losses. The week highlighted a mix of painful lessons learned from leverage, community humor at industry figures, and the stark realities of political figures capitalizing on the crypto market.

Foresight News1 год тому

Hot Takes｜Why Did the Famous "Tech Lead" Dump All His Bitcoin? The "Investment Whiz Kid" is Here!

Foresight News1 год тому

From SpaceX to trade invoices: Here’s how tokenization is changing how the world moves money

Tokenization is reshaping global finance by enabling real-time, seamless movement of assets, much like instant messaging. While traditional markets involve delays—like waiting for stock settlement—tokenized assets on blockchains allow trading anytime, settle in seconds, and increase accessibility through fractional ownership. The trend gained momentum with SpaceX's landmark IPO, as platforms began offering tokenized exposure to major equities. Beyond stocks, private credit, real estate, and commodities are moving on-chain, representing trillions in value. Scaling tokenization requires infrastructure with predictable fees, deterministic settlement, and regulatory compliance—capabilities offered by networks like XDC Network, which has processed over $1.1 billion in tokenized assets. Projections suggest the tokenized asset market could reach $18.9 trillion to $30 trillion by the mid-2030s, driven by regulatory support in regions like Brazil, the EU, and the U.S. The focus has shifted from whether tokenization is permitted to how quickly it can scale, underpinned by years of behind-the-scenes infrastructure development that will define the next decade of finance.

ambcrypto1 год тому

From SpaceX to trade invoices: Here’s how tokenization is changing how the world moves money

ambcrypto1 год тому

Bitcoin miner Riot’s 500 BTC transfer sparks sell-off fears – Correction possible IF…

Riot Platforms transferred 500 BTC (approx. $30.7M) to NYDIG Custody, sparking sell-off speculation. However, this movement alone does not confirm a sale, as it could be routine treasury management. The company's Bitcoin reserves have decreased from 19,368 BTC at the end of 2025 to 15,680 BTC, following sales earlier in 2026. This trend is mirrored by other major miners like Hut 8, Mara Holdings, and Core Scientific, who have also reduced their BTC holdings. The article links these reserve adjustments to changing mining dynamics in 2026. After a profitable late-2025 period, Bitcoin's price drop from over $120,000 to around $65,000, combined with high network hashrate and mining difficulty, created a financial squeeze. This has led to less efficient miners shutting down equipment and financially stable firms actively managing their Bitcoin reserves rather than simply holding them. The overall network hashrate has declined from its peak, putting constant pressure on mining profitability.

ambcrypto2 год тому

Bitcoin miner Riot’s 500 BTC transfer sparks sell-off fears – Correction possible IF…