Just now, DeepSeek V4 updates with DSpark, improving inference speed by 80%

marsbitPublished on 2026-06-27Last updated on 2026-06-27

Abstract

DeepSeek has updated its DeepSeek V4 model with the DSpark speculative decoding framework, achieving a significant 60-85% speedup in generation for Flash models and 57-78% for Pro models while maintaining the same overall throughput. This engineering-focused update, rather than a core architectural change, introduces DSpark to address latency and throughput bottlenecks in high-concurrency production environments. DSpark combines high-throughput parallel generation with adaptive load-aware verification. Its key innovations include a semi-autoregressive generation architecture to model dependencies within token blocks and a hardware-aware confidence-scheduled verification system. This system uses a confidence head to predict token acceptance probabilities, allowing it to dynamically optimize verification length per request and allocate compute only to tokens with the highest expected payoff. The asynchronous scheduler is designed for real-world deployment, ensuring zero-overhead scheduling and continuous CUDA graph replay while preserving the target model's output distribution. In tests across mathematical reasoning, code generation, and daily dialogue, DSpark outperformed state-of-the-art models like Eagle3 and DFlash, increasing average acceptance length by 26.7%-30.9% and 16.3%-18.4% respectively on Qwen3 target models. DeepSeek also open-sourced DeepSpec, a full-stack codebase for training and evaluating speculative decoding draft models, providing a standardized toolkit...

Just now, DeepSeek V4 received an update.

It newly launched the speculative decoding framework DSpark, and simultaneously open-sourced the full-stack speculative decoding framework DeepSpec that supports this version.

DeepSeek-V4-Pro-DSpark is not a new architecture model, but rather introduces a speculative decoding module based on DeepSeek-V4-Pro. The focus of this update is on engineering deployment, not iteration of the model's core capabilities itself.

DSpark has been deployed in the real online traffic of DeepSeek-V4 (Flash and Pro), significantly accelerating the inference speed of large language models (LLMs).

Technical Report: DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation

Technical Report Link: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

The core purpose of DSpark is to solve the latency and throughput bottlenecks faced by LLM inference in production environments (especially in high-concurrency scenarios). In short, DSpark successfully combines high-throughput "parallel generation" with adaptive "load-aware verification."

Speculative decoding is a technique to accelerate large language model inference without altering the model's output distribution. Its core idea is to introduce a lightweight "draft model" to pre-generate several candidate tokens, which are then batch-verified and accepted by the target model. This transforms serial, token-by-token generation into parallel, batch verification, greatly reducing end-to-end latency.

Building on this, DSpark's innovation lies in introducing a Semi-Autoregressive Generation architecture: it retains the high-throughput advantage of parallel draft models while incorporating a lightweight serial module to model dependency relationships between tokens within a block. This alleviates the issue of acceptance rate decay that parallel draft models tend to suffer in later positions.

In addition, there is Hardware-Aware Confidence-Scheduled Verification: previous speculative decoding would often blindly send all generated draft tokens for verification. Under high system load, these tail tokens, which have a very high probability of being rejected, waste valuable batch processing computing power. DSpark introduces a Confidence Head to estimate the survival probability of each token. Combined with a hardware-aware prefix scheduler, the system can dynamically tailor the optimal verification length for each request based on real-time engine throughput characteristics, allocating computing power only to tokens with the highest expected payoff.

To be deployed in real online infrastructure, DSpark's scheduler adopts an asynchronous mechanism to be compatible with zero-overhead scheduling (ZOS) and continuous CUDA graph replay. It uses historical predictions from the previous two steps to decide the current dynamic truncation length, thereby hiding scheduling latency, avoiding GPU pipeline stalls, and simultaneously guaranteeing the complete and lossless restoration of the target model's output distribution.

In tests covering multiple domains such as mathematical reasoning, code generation, and daily conversation, DSpark significantly outperformed the current state-of-the-art autoregressive model (Eagle3) and parallel draft model (DFlash). For example, on Qwen3 series (4B, 8B, 14B) target models, its average acceptance length improved by 26.7% to 30.9% compared to Eagle3, and by 16.3% to 18.4% compared to DFlash.

Compared to the previous generation single-token production benchmark (MTP-1) in deployment, while maintaining the same overall throughput, DSpark increased user generation speed by 60%-85% (Flash model) and 57%-78% (Pro model) respectively.

Released alongside DSpark is DeepSpec, a full-stack codebase for training and evaluating speculative decoding draft models. It is the "open-source infrastructure" that hosts this solution and other advanced algorithm implementations, containing data preparation tools, draft model implementations, training code, and evaluation scripts.

DeepSpec splits the overall workflow into three stages: data preparation, training, and evaluation. The three stages need to be run sequentially, with the output of the previous stage serving as input for the next.

In the data preparation stage, one needs to download prompt data, regenerate answers using an inference engine on the target model, and build the target cache. Notably, taking the default Qwen/Qwen3-4B configuration as an example, the target cache volume can reach about 38 TB, requiring thorough assessment of storage resources before use.

The training stage can be launched via bash scripts/train/train.sh. This script will call train.py and launch a worker for each visible GPU. Users can choose different algorithm and target model configurations in the config/ directory by specifying config_path. The project also supports adjusting training settings by overriding config_path, target_cache_dir, and using --opts to modify individual configuration fields.

Regarding hardware, DeepSpec's default configuration and scripts are designed for a single-node 8-GPU environment. If there are fewer GPUs, users need to correspondingly reduce the number of visible GPUs in CUDA_VISIBLE_DEVICES.

The evaluation stage is launched via bash scripts/eval/eval.sh. The evaluation script will use the trained draft model checkpoint to measure acceptance on multiple speculative decoding benchmark tasks. The project's currently listed evaluation datasets include GSM8K, MATH500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca, and Arena-Hard-v2, covering different task types such as mathematical reasoning, code generation, dialogue ability, and comprehensive Q&A.

In terms of algorithms, DeepSpec currently has three built-in draft models: DSpark, DFlash, and Eagle3. For target model series, the project currently supports Qwen3 and Gemma.

The open-sourcing of DeepSpec integrates the engineering practices of speculative decoding, which were previously scattered across various research teams, into a reproducible and extensible standardized toolchain. For researchers and engineers hoping to accelerate inference for their own large models, this means they can directly train custom draft models on a mature framework, skipping a large amount of repeated infrastructure building work.

Reference Links:

https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

https://github.com/deepseek-ai/DeepSpec

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), authors: Ze Nan, Yang Wen

Ethereum down 45% YTD – So why do SharpLink and whales keep buying?

Despite Ethereum (ETH) being down 20-45% year-to-date amid broader crypto weakness, institutional and whale buying activity suggests growing long-term conviction. SharpLink, after an eight-month pause, purchased 5,000 ETH and later added $45.54 million in LSETH, increasing its total holdings significantly despite substantial unrealized losses. Similarly, a new whale wallet accumulated over 18,000 ETH worth $28.9 million in nine days, indicating strategic positioning for future price movements rather than short-term trading. However, this accumulation contrasts with spot Ethereum ETFs, which saw net outflows of $12.85 million recently, highlighting a divergence between direct treasury/whale buyers and ETF investors. While the persistent buying from these large holders may gradually ease selling pressure, a sustained recovery for Ethereum still depends on a reversal in ETF flows and a broader improvement in network demand and market sentiment.

ambcrypto1h ago

Ethereum down 45% YTD – So why do SharpLink and whales keep buying?

ambcrypto1h ago

Can Aavenomics 3.0 sustain AAVE’s recovery rally amid Kraken buyout talks?

Aave Labs CEO Stani Kulechov has dismissed reports of a potential stake sale to Kraken, clarifying that Aave would not sell at a significant discount. He highlighted Aave's substantial annualized revenue and its focus on a broad financial market. Kulechov also announced plans for Aavenomics 3.0, featuring a new automated buyback mechanism. Following these updates, the AAVE token price surged 12%, extending its June recovery rally to over 50% from its recent lows. The rally is partly attributed to reduced selling pressure and positive sentiment from the announced tokenomics plan, despite the token remaining significantly below its all-time high.

ambcrypto3h ago

Can Aavenomics 3.0 sustain AAVE’s recovery rally amid Kraken buyout talks?

ambcrypto3h ago

BIT Research: The 2028 Halving Is Not the End, the Real Shake-Up of the Bitcoin Mining Industry Is Just Beginning

The Bitcoin mining industry is undergoing its most complex structural adjustment since inception. Despite Bitcoin's price holding near $61,000 and the network hash rate approaching a record 1 ZH/s, miner profitability is deteriorating. The industry is operating close to its breakeven point, with the 2028 halving expected to accelerate consolidation. The challenges extend beyond the halving's subsidy reduction; the industry's revenue model has yet to successfully transition towards a fee-driven structure. Increasingly, mining companies are evolving from simple Bitcoin producers into infrastructure and energy operators, including providers of AI/HPC computing power. Competition is shifting from pure hash rate expansion to business model upgrades. Economic pressure is evident. The theoretical daily mining revenue at current prices is around $78 million, yet the actual figure is only about $33 million—a 136% gap. Transaction fees remain low at roughly $220k daily, far below historical implied levels. With a current estimated industry-wide breakeven price near $65,000, mining alone is struggling to generate ideal profits. The 2028 halving is projected to push the fundamental production cost floor to approximately $93,289. This will likely accelerate a shift towards consolidation among larger, well-capitalized miners with diversified revenue streams. Competitive advantage will belong to institutionalized players with access to low-cost energy, AI/HPC hosting operations, and stronger balance sheets. In essence, Bitcoin mining is transitioning from a "mining business" to an "infrastructure business." Future profitability and resilience will depend less on block rewards and more on diversified income sources like energy management and computational infrastructure services. For investors, the key question is not the halving itself, but which miners can successfully navigate this business model transformation.

marsbit4h ago

BIT Research: The 2028 Halving Is Not the End, the Real Shake-Up of the Bitcoin Mining Industry Is Just Beginning

marsbit4h ago

This is How God Karpathy Uses Claude?

Andrej Karpathy, a prominent figure in AI, has reportedly joined Anthropic, leading to a noticeable decrease in his open-source contributions and social media activity. A document claiming to be his personal "CLAUDE.md" file—a set of instructions for the Claude AI to follow within a specific codebase—has been circulating online. While its authenticity is unverified, the content aligns closely with Karpathy's publicly shared principles on effective AI-assisted programming. The document outlines key rules for AI coding assistants, emphasizing the importance of reading existing code thoroughly before writing new code to maintain consistency. It advises against over-engineering, advocating for simple, surgical modifications that match the project's existing style. Other guidelines include clarifying assumptions upfront, writing meaningful tests, thoughtful debugging, and carefully considering dependencies. The core message is that these principles help prevent common AI coding failures, such as introducing unnecessary abstractions, style drift, or making invisible architectural decisions. The community has noted that even experts like Karpathy require detailed instructions to guide AI effectively, akin to managing a junior developer. A related GitHub repository, "andrej-karpathy-skills," which encapsulates these ideas, is reported to significantly reduce Claude's code error rate. Ultimately, the advice stresses that the best CLAUDE.md is tailored to one's own tech stack and coding practices.

marsbit4h ago

marsbit4h ago

Jito hits $1.75B revenue milestone, but what does this mean for its price rally?

Jito has achieved a significant milestone with $1.75 billion in gross revenue, primarily driven by MEV rewards (81%), solidifying its position as a top performer in the Solana ecosystem. This revenue growth is accompanied by increased network activity, including a major rise in active addresses and a nearly 90% surge in 24-hour trading volume to $102 million, suggesting expanding user participation rather than mere speculation. On the technical front, JTO's price has broken above a multi-month bullish flag pattern, respecting an ascending trendline. The data indicates the recent price rally may reflect these strengthening fundamentals, with the breakout's sustainability hinging on continued growth in network activity.

ambcrypto4h ago

Jito hits $1.75B revenue milestone, but what does this mean for its price rally?

ambcrypto4h ago

Trading

Spot

Just now, DeepSeek V4 updates with DSpark, improving inference speed by 80%

Abstract

Related Questions

Related Reads

Ethereum down 45% YTD – So why do SharpLink and whales keep buying?

Can Aavenomics 3.0 sustain AAVE’s recovery rally amid Kraken buyout talks?

BIT Research: The 2028 Halving Is Not the End, the Real Shake-Up of the Bitcoin Mining Industry Is Just Beginning

This is How God Karpathy Uses Claude?

Jito hits $1.75B revenue milestone, but what does this mean for its price rally?

Trading

Hot Categories

Hot Tags