China's Chips, the Hidden Intersection of DeepSeek and Kimi

marsbit2026-04-22 tarihinde yayınlandı2026-04-22 tarihinde güncellendi

Özet

China's AI startup Kimi has released its latest open-source model, K2.6, featuring significant improvements in coding and agent capabilities. This version enhances long-context processing, supports up to 4,000 lines of code generation, and improves API accuracy and multi-agent collaboration, allowing up to 300 sub-agents to work in parallel. Kimi also introduced architectural innovations like MuonClip, Kimi Linear, and Attention Residuals to optimize scaling and efficiency. A key development is the "Prefill-as-a-Service" (PrfaaS) approach, which decouples prefill and decode tasks across data centers using heterogeneous hardware, significantly reducing latency and cost per token. This strategy not only improves performance but also creates opportunities for integration with domestic AI chips, as reliance on Chinese hardware grows due to export restrictions. Both Kimi and DeepSeek are increasingly aligning with local semiconductor advancements, shaping a new ecosystem for AI development in China.

"K2.6 is our strongest code model to date," Kimi wrote on its official account.

On the evening of April 20th, Kimi officially launched the open-source model K2.6, which demonstrates stronger programming and Agent capabilities, about a quarter after the release of version K2.5.

There was also a minor episode: rumors suggested that DeepSeek V4 would be released this week. If everything proceeds as expected by the outside world, this would be the Nth time Kimi and DeepSeek have coincided. But at a more fundamental infrastructure level, there is an underlying thread: Kimi and DeepSeek, these two large model startups, are ultimately destined to step into the same river—advancing together with domestic chip startups.

Rewind to March 2026, when Yang Zhilin took the stage at NVIDIA's GTC conference to discuss Kimi's technical roadmap. He said, "Many of the commonly used technical standards today are essentially products from eight or nine years ago, gradually becoming a bottleneck for Scaling."

To address such issues, Kimi has contributed to the open-source community the first large-scale application of the second-order optimizer MuonClip, the Kimi Linear architecture that makes large models more efficient at processing long contexts, and Attention Residuals, which optimizes the connections between deep neural network layers.

Kimi's Scaling Strategy

Yang Zhilin believes that Kimi's evolution logic can be summarized as the "merger" of Token efficiency, long context, and Agent clusters. The newly launched Kimi K2.6 can be understood as a new assignment submitted by Yang Zhilin along this Scaling path.

Kimi's official website has integrated K2.6

Code, Agent, and What Else?

As one of the most easily standardized capabilities, code is a must-win area for cutting-edge models.

From K2 to K2.5 to K2.6, Kimi has maintained an iteration rhythm of about one quarter on several open-source models. However, since this is a minor version number, it hints that Yang Zhilin may have more cards up his sleeve.

"K2.6 has significantly improved long-range coding capabilities, able to code uninterrupted for 13 hours in tests, writing or modifying over 4,000 lines of code," Kimi wrote in a promotional material. "On the Kimi Code Bench, Kimi's internal strict code evaluation benchmark covering various complex end-to-end tasks, K2.6's score improved by about 20% compared to K2.5."

It's worth noting that K2.5 was already a very "capable model," topping the OpenRouter charts in February. A source close to Kimi posted a screenshot of co-founder Zhang Yutao's朋友圈 at the time, saying, "He seemed very satisfied with this version."

K2.6's performance on general Agent, programming, and visual Agent benchmarks

For Agent frameworks like OpenClaw and Hermes, K2.6's core improvements focus on the accuracy of API calls and the stability of long-running operations—one enhances the cost of task execution, while the other optimizes the efficiency of task execution.

In the K2.5 version launched in January, Kimi introduced the concept of "Agent clusters," breaking down a task into multiple sub-tasks and automatically assigning them to different specialized Agents for processing, thereby reducing task processing time and avoiding the risk of entire project failure under serial task flows.

Demonstration of Kimi K2.6's Agent cluster capability

In the new K2.6 version, this capability is further amplified, integrating and parallelizing breadth search with in-depth research, large-scale document analysis and long-form writing, and multi-format content generation, supporting up to 300 sub-Agents completing 4,000 collaborative steps in parallel.

To summarize the highlights of Kimi K2.6 in one sentence, they大致包括: evolution in code and long-range task capabilities, evolution of Agent cluster capabilities, and optimization for mainstream Agent frameworks.

If I had to pick a personal preference from the above features, I believe the Agent cluster is the most valuable capability—it directly embodies the explosive power of parallel computing. Whether it's code or the stability of long-range tasks, these are things that model iteration must address. More importantly, based on these capability improvements, they drive innovation in Agent working methods, efficiency, and even interaction modes.

After all, as a user, what I want is not for it to tell me what it can do, but for it to drive Agents to solve my real problems and form effective productivity.

When K2.5 was launched, an academic researcher began using this model for scientific research projects. His evaluation at the time was that it had no weaknesses and could serve as a research assistant.

"The multi-Agent provided by the official is indeed effective; many domestic Agents last year were still toys."

If Kimi K2.5 received positive evaluations both internally and externally, how effective will K2.6, which goes a step further, be?

On the Artifacial Analysis intelligence leaderboard, Kimi K2.6 ranks仅次于 three closed-source models and leads the open-source model weight leaderboard

The "New Story" in the Roadmap

Kimi always occasionally brings something new to the industry, including the MuonClip, Kimi Linear, and Attention Residuals mentioned in Yang Zhilin's roadmap speech. Some explorations have also received positive recognition from industry top players.

In mid-March, Kimi published a paper on Attention Residuals, proposing the use of attention mechanisms to改造残差连接. Musk directly tweeted that this was "an impressive breakthrough by Kimi."

Last weekend, Kimi published a new paper titled "Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter" (PrfaaS, Prefill-as-a-Service), mentioning Kimi's new exploration in architecture,核心讨论的仍然是 PD separation (Prefill and Decode).

PD separation is not a new topic—the Prefill stage of model inference is a computationally intensive task, while the Decode stage relies on memory bandwidth, with memory repeatedly reading and writing KV Cache. This architecture aims to decouple compute-intensive tasks from bandwidth-intensive tasks, improving compute utilization and throughput, thereby reducing costs and increasing efficiency.

Although PD separation is good, there is a sticking point: it must be based on RDMA high-speed networks within the same data center.

The core point of Kimi's PrfaaS paper is: based on a hybrid model (Kimi Linear), it significantly reduces the KV cache size, and then completely decouples Prefill and Decode into different heterogeneous clusters.

The experimental example mentioned in the paper shows that the PrfaaS dedicated prefill cluster uses 32 H200 GPUs focused on high compute power; the local PD decoding cluster uses 64 H20 GPUs interconnected via RDMA internal network; the two clusters are connected via VPC dedicated line, with a total cross-cluster bandwidth of about 100Gbps. The test model is a 1T parameter Kimi Linear hybrid attention model.

Actual test results show that the PrfaaS-PD cross-data center solution, compared to the 96-card H20 same PD cluster solution, improves throughput by 54%, reduces P90 TTFT (the waiting time for 90% of users from sending a request to seeing the first character returned) from 9.73s to 3.51s, a reduction of 64%, and the cross-data center KV cache transmission bandwidth only占用 13% of the total 100Gbps bandwidth.

Comparison of KV throughput between hybrid architecture models and dense models under different context lengths

To demonstrate the advantages of the hybrid model architecture, the paper mentions a set of experiments: under an 8-card H200 and SGLang v0.5.9 inference framework, benchmark tests were conducted on several mainstream models. At a context length of 32K, the KV throughput of the MiMo-V2-Flash model using hybrid attention was only 4.66Gbps, while the similarly scaled dense attention model MiniMax-M2.5 reached 59.93Gbps, directly proving that the hybrid attention architecture can reduce KV cache transmission requirements to within the range manageable by ordinary Ethernet.

"Cross-data center + heterogeneous hardware unlocks the potential to significantly reduce per-token cost," Kimi said on its official account.

Regarding token cost reduction, I mentioned in the article "The People Miss DeepSeek" that there is room for optimization at both the model and hardware levels. Professor Hu Yanping from Shanghai University of Finance and Economics specifically posted a朋友圈, emphasizing that cost reduction cannot rely on DeepSeek alone. "The solution depends on the cost efficiency of compute power supply,跨代提升 of model quality, continuous advancement of intelligence paradigms, and the放大效应 of workflow and scenario integration."

From this perspective, Kimi has told the industry a new story about token cost reduction.

Chinese Models Summon Chinese Chips

In the Prefill-as-a-Service paper, more people only noticed the cross-data center narrative, while overlooking the point about heterogeneous hardware.

It is important to note that H200 and H20 are still based on the Hopper architecture in terms of chip design. The heterogeneity mentioned in the paper refers to heterogeneity in bandwidth and compute power. Its启示 is: we can use一部分 compute-powerful domestic cards for Prefill, or bandwidth-strong domestic cards for Decode, and of course, they can also be mixed with overseas cards to achieve cost reduction and efficiency improvement.

It can be said that this is a door opened by Kimi for Chinese chips in large model inference.

In the view of a domestic compute power insider, to catch this wave of traffic benefits brought by the Prefill-as-a-Service solution, they still have to face the old problem of ecosystem.

Over the past few years, Chinese large models have been stuck outside domestic compute power due to ecosystem challenges, but there is another unnoticed detail: products like the H20 have been断供 for a year. In other words, in the short term, there is only one option for inference chips: domestic.

As inference demand surges, compared to supply, ecosystem challenges will switch to secondary issues—the dependence of Chinese large models on domestic compute power has changed from optional to不得不使用. Because of this, many predictions are discussing that DeepSeek V4 is adapting to domestic compute power.

In my article with Professor Hu Yanping, "The Last Urging Letter to DeepSeek," we said that adapting to domestic compute power is a very difficult road for domestic models, but in the longer term, it has to be done. Something that must be done always needs a starting point, and perhaps DeepSeek V4 is that starting point.

Now, DeepSeek V4 has not yet arrived, but Kimi has already used its own practice to explore a feasible path for the combination of Chinese models + Chinese chips.

Kimi has taken the lead as a model representative in extending an olive branch; the problem now lies with domestic chip startups.

Does everyone remember Huang Renxun's reaction when asked about the chip export ban to China in the latest episode of "the Dwarkesh Podcast"? He said that chips are not uranium enrichment, and禁售 cannot stop the progress of Chinese chips; they can still develop models by暴力堆叠 domestic chips.

Why did Huang Renxun say this? The next step for DeepSeek and Kimi is the standard answer.

This article is from the WeChat public account "Tencent Technology," author: Su Yang, editor: Xu Qingyang

İlgili Sorular

QWhat are the key improvements in Kimi's K2.6 model compared to K2.5?

AKimi's K2.6 model shows significant improvements in long-range coding capabilities, allowing uninterrupted coding for up to 13 hours and handling over 4000 lines of code. It also enhances API call accuracy and long-running stability for Agent frameworks, and expands Agent cluster capabilities to support up to 300 sub-agents performing 4000 collaborative steps in parallel. Overall, it achieves a 20% performance boost on Kimi's internal code benchmark compared to K2.5.

QHow does Kimi's Prefill-as-a-Service (PrfaaS) architecture improve efficiency?

AKimi's PrfaaS architecture decouples the Prefill (compute-intensive) and Decode (memory bandwidth-intensive) stages of model inference across different heterogeneous clusters, even across data centers. By using a hybrid attention model (Kimi Linear) to reduce KV cache size, it allows the use of high-compute chips (e.g., H200) for Prefill and high-bandwidth chips (e.g., H20) for Decode, connected via VPC dedicated lines. This approach increases throughput by 54%, reduces latency (P90 TTFT) by 64%, and significantly lowers token costs.

QWhat role do Chinese chips play in the future of AI companies like Kimi and DeepSeek?

AChinese chips are becoming a critical infrastructure for AI companies like Kimi and DeepSeek due to export restrictions on high-end GPUs like NVIDIA's H20. Kimi's PrfaaS architecture demonstrates how heterogeneous hardware—including domestic chips—can be used for Prefill and Decode tasks, offering a viable path for cost-efficient inference. As推理 demand grows, reliance on domestic chips shifts from optional to necessary, pushing Chinese AI firms to adapt and collaborate with local chip startups.

QWhat did Elon Musk say about Kimi's research contribution?

AElon Musk tweeted that Kimi's work on Attention Residuals—a technique using attention mechanisms to modify residual connections—was 'an impressive breakthrough by Kimi.'

QHow does Kimi's Agent cluster capability enhance productivity?

AKimi's Agent cluster capability breaks down complex tasks into sub-tasks distributed among specialized agents, enabling parallel processing. This reduces task failure rates and improves efficiency by avoiding serial bottlenecks. In K2.6, it integrates breadth-depth search, large-scale document analysis, long-form writing, and multi-format content generation, supporting up to 300 sub-agents and 4000 collaborative steps, making it a concrete productivity tool for users.

İlgili Okumalar

Hong Kong Web3 Carnival: The Watershed Moment for Web3 Entering the Execution Phase

The 2026 Hong Kong Web3 Carnival marked a significant shift from previous industry discussions, signaling that Web3 has moved beyond theoretical validation into a phase of institutional and structural implementation. Hong Kong is not merely building a "Web3 industry cluster" but developing an operating system for the next-generation financial infrastructure. Key developments include the expansion of asset tokenization beyond cryptocurrencies to encompass bonds, real estate, and future income rights. This transition represents a fundamental restructuring of financial logic—shifting from institution-dominated asset control to rule-driven, programmable asset流动性 and distribution. Tokenization enables lower-friction participation and broader access to financial resources. Concurrently, AI is evolving from a tool into an autonomous economic agent. The proposed Decentralized Agentic Economy (DAE) framework suggests AI agents, empowered by blockchain-based identity and programmable money, will independently execute transactions and strategies—redefining market dynamics and reducing intermediation. Regulatory progress has been systematic: Hong Kong has expanded oversight to include exchanges, custody, staking, and derivatives, while gradually approving products like tokenized funds and stablecoins. The "same risk, same regulation" principle, combined with sandbox mechanisms, provides stability and transparency—key advantages in a globally fragmented regulatory landscape. Hong Kong’s approach integrates three core elements: real-world asset (RWA) tokenization, stablecoin settlement networks, and AI-driven economic agents. This systemic build-up positions Hong Kong not just as a participant but as a potential rule-maker in the next-era financial system, where asset flow, rules, and participants are simultaneously transformed.

marsbit24 dk önce

Hong Kong Web3 Carnival: The Watershed Moment for Web3 Entering the Execution Phase

marsbit24 dk önce

ENI Officially Announces Completion of Strategic Brand Upgrade: Evolving from Underlying Protocol to Global Institutional-Grade Financial Infrastructure

ENI Announces Strategic Brand Upgrade: Evolving from Underlying Protocol to Global Institutional-Grade Financial Infrastructure At the Hong Kong Web3 Festival on April 20, 2026, ENI founder and CEO Arion Ho announced the completion of a comprehensive brand, website, and visual system upgrade. This marks a significant shift from being an "underlying public chain" to an enterprise-grade Blockchain-as-a-Service (BaaS) platform, positioning ENI as a key infrastructure provider bridging traditional finance (TradFi) and Web3. The rebranding emphasizes precision and professionalism, reflected in a refined visual identity featuring a 25-degree tilt and a 1:4 golden ratio in its design elements. This aesthetic upgrade, led by a top-tier design team with experience from Hermès and ByteDance, underscores ENI’s commitment to institutional-grade trust and global sophistication. ENI now functions as a bridge between technological innovation and real-world business applications. It offers tailored architecture solutions for large institutions and standardized, low-friction BaaS tools for SMEs, enabling seamless integration of Web3 capabilities into existing business models. The announcement in Hong Kong, a global financial hub, signals ENI’s matured, global-ready approach to supporting the commercial adoption of Web3. By providing a stable, standardized platform, ENI aims to facilitate the secure migration of real-world assets and operations into the digital economy.

marsbit50 dk önce

ENI Officially Announces Completion of Strategic Brand Upgrade: Evolving from Underlying Protocol to Global Institutional-Grade Financial Infrastructure

marsbit50 dk önce

İşlemler

Spot
Futures
活动图片