Just now, DeepSeek V4 updates with DSpark, improving inference speed by 80%

marsbitPublicado a 2026-06-27Actualizado a 2026-06-27

Resumen

DeepSeek has updated its DeepSeek V4 model with the DSpark speculative decoding framework, achieving a significant 60-85% speedup in generation for Flash models and 57-78% for Pro models while maintaining the same overall throughput. This engineering-focused update, rather than a core architectural change, introduces DSpark to address latency and throughput bottlenecks in high-concurrency production environments. DSpark combines high-throughput parallel generation with adaptive load-aware verification. Its key innovations include a semi-autoregressive generation architecture to model dependencies within token blocks and a hardware-aware confidence-scheduled verification system. This system uses a confidence head to predict token acceptance probabilities, allowing it to dynamically optimize verification length per request and allocate compute only to tokens with the highest expected payoff. The asynchronous scheduler is designed for real-world deployment, ensuring zero-overhead scheduling and continuous CUDA graph replay while preserving the target model's output distribution. In tests across mathematical reasoning, code generation, and daily dialogue, DSpark outperformed state-of-the-art models like Eagle3 and DFlash, increasing average acceptance length by 26.7%-30.9% and 16.3%-18.4% respectively on Qwen3 target models. DeepSeek also open-sourced DeepSpec, a full-stack codebase for training and evaluating speculative decoding draft models, providing a standardized toolkit...

Just now, DeepSeek V4 received an update.

It newly launched the speculative decoding framework DSpark, and simultaneously open-sourced the full-stack speculative decoding framework DeepSpec that supports this version.

DeepSeek-V4-Pro-DSpark is not a new architecture model, but rather introduces a speculative decoding module based on DeepSeek-V4-Pro. The focus of this update is on engineering deployment, not iteration of the model's core capabilities itself.

DSpark has been deployed in the real online traffic of DeepSeek-V4 (Flash and Pro), significantly accelerating the inference speed of large language models (LLMs).

Technical Report: DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation

Technical Report Link: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

The core purpose of DSpark is to solve the latency and throughput bottlenecks faced by LLM inference in production environments (especially in high-concurrency scenarios). In short, DSpark successfully combines high-throughput "parallel generation" with adaptive "load-aware verification."

Speculative decoding is a technique to accelerate large language model inference without altering the model's output distribution. Its core idea is to introduce a lightweight "draft model" to pre-generate several candidate tokens, which are then batch-verified and accepted by the target model. This transforms serial, token-by-token generation into parallel, batch verification, greatly reducing end-to-end latency.

Building on this, DSpark's innovation lies in introducing a Semi-Autoregressive Generation architecture: it retains the high-throughput advantage of parallel draft models while incorporating a lightweight serial module to model dependency relationships between tokens within a block. This alleviates the issue of acceptance rate decay that parallel draft models tend to suffer in later positions.

In addition, there is Hardware-Aware Confidence-Scheduled Verification: previous speculative decoding would often blindly send all generated draft tokens for verification. Under high system load, these tail tokens, which have a very high probability of being rejected, waste valuable batch processing computing power. DSpark introduces a Confidence Head to estimate the survival probability of each token. Combined with a hardware-aware prefix scheduler, the system can dynamically tailor the optimal verification length for each request based on real-time engine throughput characteristics, allocating computing power only to tokens with the highest expected payoff.

To be deployed in real online infrastructure, DSpark's scheduler adopts an asynchronous mechanism to be compatible with zero-overhead scheduling (ZOS) and continuous CUDA graph replay. It uses historical predictions from the previous two steps to decide the current dynamic truncation length, thereby hiding scheduling latency, avoiding GPU pipeline stalls, and simultaneously guaranteeing the complete and lossless restoration of the target model's output distribution.

In tests covering multiple domains such as mathematical reasoning, code generation, and daily conversation, DSpark significantly outperformed the current state-of-the-art autoregressive model (Eagle3) and parallel draft model (DFlash). For example, on Qwen3 series (4B, 8B, 14B) target models, its average acceptance length improved by 26.7% to 30.9% compared to Eagle3, and by 16.3% to 18.4% compared to DFlash.

Compared to the previous generation single-token production benchmark (MTP-1) in deployment, while maintaining the same overall throughput, DSpark increased user generation speed by 60%-85% (Flash model) and 57%-78% (Pro model) respectively.

Released alongside DSpark is DeepSpec, a full-stack codebase for training and evaluating speculative decoding draft models. It is the "open-source infrastructure" that hosts this solution and other advanced algorithm implementations, containing data preparation tools, draft model implementations, training code, and evaluation scripts.

DeepSpec splits the overall workflow into three stages: data preparation, training, and evaluation. The three stages need to be run sequentially, with the output of the previous stage serving as input for the next.

In the data preparation stage, one needs to download prompt data, regenerate answers using an inference engine on the target model, and build the target cache. Notably, taking the default Qwen/Qwen3-4B configuration as an example, the target cache volume can reach about 38 TB, requiring thorough assessment of storage resources before use.

The training stage can be launched via bash scripts/train/train.sh. This script will call train.py and launch a worker for each visible GPU. Users can choose different algorithm and target model configurations in the config/ directory by specifying config_path. The project also supports adjusting training settings by overriding config_path, target_cache_dir, and using --opts to modify individual configuration fields.

Regarding hardware, DeepSpec's default configuration and scripts are designed for a single-node 8-GPU environment. If there are fewer GPUs, users need to correspondingly reduce the number of visible GPUs in CUDA_VISIBLE_DEVICES.

The evaluation stage is launched via bash scripts/eval/eval.sh. The evaluation script will use the trained draft model checkpoint to measure acceptance on multiple speculative decoding benchmark tasks. The project's currently listed evaluation datasets include GSM8K, MATH500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca, and Arena-Hard-v2, covering different task types such as mathematical reasoning, code generation, dialogue ability, and comprehensive Q&A.

In terms of algorithms, DeepSpec currently has three built-in draft models: DSpark, DFlash, and Eagle3. For target model series, the project currently supports Qwen3 and Gemma.

The open-sourcing of DeepSpec integrates the engineering practices of speculative decoding, which were previously scattered across various research teams, into a reproducible and extensible standardized toolchain. For researchers and engineers hoping to accelerate inference for their own large models, this means they can directly train custom draft models on a mature framework, skipping a large amount of repeated infrastructure building work.

Reference Links:

https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

https://github.com/deepseek-ai/DeepSpec

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), authors: Ze Nan, Yang Wen

Preguntas relacionadas

QWhat is the main innovation introduced in the DeepSeek V4 update mentioned in the article?

AThe main innovation introduced is the speculative decoding framework DSpark, which features a Semi-Autoregressive Generation architecture and Confidence-Scheduled Verification.

QAccording to the article, what was the core motivation for developing DSpark?

AThe core motivation was to solve the latency and throughput bottlenecks faced by LLM inference in production environments, especially under high-concurrency scenarios.

QHow does DSpark's Confidence-Scheduled Verification aim to improve efficiency compared to previous speculative decoding methods?

AIt uses a Confidence Head to assess each token's survival probability and a hardware-aware scheduler to dynamically determine the optimal verification length per request. This allocates computing power only to tokens with the highest expected payoff, avoiding wasted effort on tokens likely to be rejected.

QWhat performance improvement does DSpark achieve over the previous single-token production baseline (MTP-1) for the Flash model, according to the article?

ADSpark improves user generation speed by 60% to 85% for the Flash model while maintaining the same overall throughput.

QWhat is DeepSpec, and what purpose does it serve according to the article?

ADeepSpec is a full-stack open-source codebase released alongside DSpark. It is an infrastructure for training and evaluating speculative decoding draft models, providing tools for data preparation, model implementation, training, and evaluation to streamline custom draft model development.

Lecturas Relacionadas

BIT Research: The 2028 Halving Is Not the End, the Real Shake-Up of the Bitcoin Mining Industry Is Just Beginning

The Bitcoin mining industry is undergoing its most complex structural adjustment since inception. Despite Bitcoin's price holding near $61,000 and the network hash rate approaching a record 1 ZH/s, miner profitability is deteriorating. The industry is operating close to its breakeven point, with the 2028 halving expected to accelerate consolidation. The challenges extend beyond the halving's subsidy reduction; the industry's revenue model has yet to successfully transition towards a fee-driven structure. Increasingly, mining companies are evolving from simple Bitcoin producers into infrastructure and energy operators, including providers of AI/HPC computing power. Competition is shifting from pure hash rate expansion to business model upgrades. Economic pressure is evident. The theoretical daily mining revenue at current prices is around $78 million, yet the actual figure is only about $33 million—a 136% gap. Transaction fees remain low at roughly $220k daily, far below historical implied levels. With a current estimated industry-wide breakeven price near $65,000, mining alone is struggling to generate ideal profits. The 2028 halving is projected to push the fundamental production cost floor to approximately $93,289. This will likely accelerate a shift towards consolidation among larger, well-capitalized miners with diversified revenue streams. Competitive advantage will belong to institutionalized players with access to low-cost energy, AI/HPC hosting operations, and stronger balance sheets. In essence, Bitcoin mining is transitioning from a "mining business" to an "infrastructure business." Future profitability and resilience will depend less on block rewards and more on diversified income sources like energy management and computational infrastructure services. For investors, the key question is not the halving itself, but which miners can successfully navigate this business model transformation.

marsbitHace 4 hora(s)

BIT Research: The 2028 Halving Is Not the End, the Real Shake-Up of the Bitcoin Mining Industry Is Just Beginning

marsbitHace 4 hora(s)

This is How God Karpathy Uses Claude?

Andrej Karpathy, a prominent figure in AI, has reportedly joined Anthropic, leading to a noticeable decrease in his open-source contributions and social media activity. A document claiming to be his personal "CLAUDE.md" file—a set of instructions for the Claude AI to follow within a specific codebase—has been circulating online. While its authenticity is unverified, the content aligns closely with Karpathy's publicly shared principles on effective AI-assisted programming. The document outlines key rules for AI coding assistants, emphasizing the importance of reading existing code thoroughly before writing new code to maintain consistency. It advises against over-engineering, advocating for simple, surgical modifications that match the project's existing style. Other guidelines include clarifying assumptions upfront, writing meaningful tests, thoughtful debugging, and carefully considering dependencies. The core message is that these principles help prevent common AI coding failures, such as introducing unnecessary abstractions, style drift, or making invisible architectural decisions. The community has noted that even experts like Karpathy require detailed instructions to guide AI effectively, akin to managing a junior developer. A related GitHub repository, "andrej-karpathy-skills," which encapsulates these ideas, is reported to significantly reduce Claude's code error rate. Ultimately, the advice stresses that the best CLAUDE.md is tailored to one's own tech stack and coding practices.

marsbitHace 4 hora(s)

This is How God Karpathy Uses Claude?

marsbitHace 4 hora(s)

Trading

Spot
活动图片