China's Chips, the Hidden Intersection of DeepSeek and Kimi

marsbitPublicado a 2026-04-22Actualizado a 2026-04-22

Resumen

China's AI startup Kimi has released its latest open-source model, K2.6, featuring significant improvements in coding and agent capabilities. This version enhances long-context processing, supports up to 4,000 lines of code generation, and improves API accuracy and multi-agent collaboration, allowing up to 300 sub-agents to work in parallel. Kimi also introduced architectural innovations like MuonClip, Kimi Linear, and Attention Residuals to optimize scaling and efficiency. A key development is the "Prefill-as-a-Service" (PrfaaS) approach, which decouples prefill and decode tasks across data centers using heterogeneous hardware, significantly reducing latency and cost per token. This strategy not only improves performance but also creates opportunities for integration with domestic AI chips, as reliance on Chinese hardware grows due to export restrictions. Both Kimi and DeepSeek are increasingly aligning with local semiconductor advancements, shaping a new ecosystem for AI development in China.

"K2.6 is our strongest code model to date," Kimi wrote on its official account.

On the evening of April 20th, Kimi officially launched the open-source model K2.6, which demonstrates stronger programming and Agent capabilities, about a quarter after the release of version K2.5.

There was also a minor episode: rumors suggested that DeepSeek V4 would be released this week. If everything proceeds as expected by the outside world, this would be the Nth time Kimi and DeepSeek have coincided. But at a more fundamental infrastructure level, there is an underlying thread: Kimi and DeepSeek, these two large model startups, are ultimately destined to step into the same river—advancing together with domestic chip startups.

Rewind to March 2026, when Yang Zhilin took the stage at NVIDIA's GTC conference to discuss Kimi's technical roadmap. He said, "Many of the commonly used technical standards today are essentially products from eight or nine years ago, gradually becoming a bottleneck for Scaling."

To address such issues, Kimi has contributed to the open-source community the first large-scale application of the second-order optimizer MuonClip, the Kimi Linear architecture that makes large models more efficient at processing long contexts, and Attention Residuals, which optimizes the connections between deep neural network layers.

Kimi's Scaling Strategy

Yang Zhilin believes that Kimi's evolution logic can be summarized as the "merger" of Token efficiency, long context, and Agent clusters. The newly launched Kimi K2.6 can be understood as a new assignment submitted by Yang Zhilin along this Scaling path.

Kimi's official website has integrated K2.6

Code, Agent, and What Else?

As one of the most easily standardized capabilities, code is a must-win area for cutting-edge models.

From K2 to K2.5 to K2.6, Kimi has maintained an iteration rhythm of about one quarter on several open-source models. However, since this is a minor version number, it hints that Yang Zhilin may have more cards up his sleeve.

"K2.6 has significantly improved long-range coding capabilities, able to code uninterrupted for 13 hours in tests, writing or modifying over 4,000 lines of code," Kimi wrote in a promotional material. "On the Kimi Code Bench, Kimi's internal strict code evaluation benchmark covering various complex end-to-end tasks, K2.6's score improved by about 20% compared to K2.5."

It's worth noting that K2.5 was already a very "capable model," topping the OpenRouter charts in February. A source close to Kimi posted a screenshot of co-founder Zhang Yutao's朋友圈 at the time, saying, "He seemed very satisfied with this version."

K2.6's performance on general Agent, programming, and visual Agent benchmarks

For Agent frameworks like OpenClaw and Hermes, K2.6's core improvements focus on the accuracy of API calls and the stability of long-running operations—one enhances the cost of task execution, while the other optimizes the efficiency of task execution.

In the K2.5 version launched in January, Kimi introduced the concept of "Agent clusters," breaking down a task into multiple sub-tasks and automatically assigning them to different specialized Agents for processing, thereby reducing task processing time and avoiding the risk of entire project failure under serial task flows.

Demonstration of Kimi K2.6's Agent cluster capability

In the new K2.6 version, this capability is further amplified, integrating and parallelizing breadth search with in-depth research, large-scale document analysis and long-form writing, and multi-format content generation, supporting up to 300 sub-Agents completing 4,000 collaborative steps in parallel.

To summarize the highlights of Kimi K2.6 in one sentence, they大致包括: evolution in code and long-range task capabilities, evolution of Agent cluster capabilities, and optimization for mainstream Agent frameworks.

If I had to pick a personal preference from the above features, I believe the Agent cluster is the most valuable capability—it directly embodies the explosive power of parallel computing. Whether it's code or the stability of long-range tasks, these are things that model iteration must address. More importantly, based on these capability improvements, they drive innovation in Agent working methods, efficiency, and even interaction modes.

After all, as a user, what I want is not for it to tell me what it can do, but for it to drive Agents to solve my real problems and form effective productivity.

When K2.5 was launched, an academic researcher began using this model for scientific research projects. His evaluation at the time was that it had no weaknesses and could serve as a research assistant.

"The multi-Agent provided by the official is indeed effective; many domestic Agents last year were still toys."

If Kimi K2.5 received positive evaluations both internally and externally, how effective will K2.6, which goes a step further, be?

On the Artifacial Analysis intelligence leaderboard, Kimi K2.6 ranks仅次于 three closed-source models and leads the open-source model weight leaderboard

The "New Story" in the Roadmap

Kimi always occasionally brings something new to the industry, including the MuonClip, Kimi Linear, and Attention Residuals mentioned in Yang Zhilin's roadmap speech. Some explorations have also received positive recognition from industry top players.

In mid-March, Kimi published a paper on Attention Residuals, proposing the use of attention mechanisms to改造残差连接. Musk directly tweeted that this was "an impressive breakthrough by Kimi."

Last weekend, Kimi published a new paper titled "Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter" (PrfaaS, Prefill-as-a-Service), mentioning Kimi's new exploration in architecture,核心讨论的仍然是 PD separation (Prefill and Decode).

PD separation is not a new topic—the Prefill stage of model inference is a computationally intensive task, while the Decode stage relies on memory bandwidth, with memory repeatedly reading and writing KV Cache. This architecture aims to decouple compute-intensive tasks from bandwidth-intensive tasks, improving compute utilization and throughput, thereby reducing costs and increasing efficiency.

Although PD separation is good, there is a sticking point: it must be based on RDMA high-speed networks within the same data center.

The core point of Kimi's PrfaaS paper is: based on a hybrid model (Kimi Linear), it significantly reduces the KV cache size, and then completely decouples Prefill and Decode into different heterogeneous clusters.

The experimental example mentioned in the paper shows that the PrfaaS dedicated prefill cluster uses 32 H200 GPUs focused on high compute power; the local PD decoding cluster uses 64 H20 GPUs interconnected via RDMA internal network; the two clusters are connected via VPC dedicated line, with a total cross-cluster bandwidth of about 100Gbps. The test model is a 1T parameter Kimi Linear hybrid attention model.

Actual test results show that the PrfaaS-PD cross-data center solution, compared to the 96-card H20 same PD cluster solution, improves throughput by 54%, reduces P90 TTFT (the waiting time for 90% of users from sending a request to seeing the first character returned) from 9.73s to 3.51s, a reduction of 64%, and the cross-data center KV cache transmission bandwidth only占用 13% of the total 100Gbps bandwidth.

Comparison of KV throughput between hybrid architecture models and dense models under different context lengths

To demonstrate the advantages of the hybrid model architecture, the paper mentions a set of experiments: under an 8-card H200 and SGLang v0.5.9 inference framework, benchmark tests were conducted on several mainstream models. At a context length of 32K, the KV throughput of the MiMo-V2-Flash model using hybrid attention was only 4.66Gbps, while the similarly scaled dense attention model MiniMax-M2.5 reached 59.93Gbps, directly proving that the hybrid attention architecture can reduce KV cache transmission requirements to within the range manageable by ordinary Ethernet.

"Cross-data center + heterogeneous hardware unlocks the potential to significantly reduce per-token cost," Kimi said on its official account.

Regarding token cost reduction, I mentioned in the article "The People Miss DeepSeek" that there is room for optimization at both the model and hardware levels. Professor Hu Yanping from Shanghai University of Finance and Economics specifically posted a朋友圈, emphasizing that cost reduction cannot rely on DeepSeek alone. "The solution depends on the cost efficiency of compute power supply,跨代提升 of model quality, continuous advancement of intelligence paradigms, and the放大效应 of workflow and scenario integration."

From this perspective, Kimi has told the industry a new story about token cost reduction.

Chinese Models Summon Chinese Chips

In the Prefill-as-a-Service paper, more people only noticed the cross-data center narrative, while overlooking the point about heterogeneous hardware.

It is important to note that H200 and H20 are still based on the Hopper architecture in terms of chip design. The heterogeneity mentioned in the paper refers to heterogeneity in bandwidth and compute power. Its启示 is: we can use一部分 compute-powerful domestic cards for Prefill, or bandwidth-strong domestic cards for Decode, and of course, they can also be mixed with overseas cards to achieve cost reduction and efficiency improvement.

It can be said that this is a door opened by Kimi for Chinese chips in large model inference.

In the view of a domestic compute power insider, to catch this wave of traffic benefits brought by the Prefill-as-a-Service solution, they still have to face the old problem of ecosystem.

Over the past few years, Chinese large models have been stuck outside domestic compute power due to ecosystem challenges, but there is another unnoticed detail: products like the H20 have been断供 for a year. In other words, in the short term, there is only one option for inference chips: domestic.

As inference demand surges, compared to supply, ecosystem challenges will switch to secondary issues—the dependence of Chinese large models on domestic compute power has changed from optional to不得不使用. Because of this, many predictions are discussing that DeepSeek V4 is adapting to domestic compute power.

In my article with Professor Hu Yanping, "The Last Urging Letter to DeepSeek," we said that adapting to domestic compute power is a very difficult road for domestic models, but in the longer term, it has to be done. Something that must be done always needs a starting point, and perhaps DeepSeek V4 is that starting point.

Now, DeepSeek V4 has not yet arrived, but Kimi has already used its own practice to explore a feasible path for the combination of Chinese models + Chinese chips.

Kimi has taken the lead as a model representative in extending an olive branch; the problem now lies with domestic chip startups.

Does everyone remember Huang Renxun's reaction when asked about the chip export ban to China in the latest episode of "the Dwarkesh Podcast"? He said that chips are not uranium enrichment, and禁售 cannot stop the progress of Chinese chips; they can still develop models by暴力堆叠 domestic chips.

Why did Huang Renxun say this? The next step for DeepSeek and Kimi is the standard answer.

This article is from the WeChat public account "Tencent Technology," author: Su Yang, editor: Xu Qingyang

Preguntas relacionadas

QWhat are the key improvements in Kimi's K2.6 model compared to K2.5?

AKimi's K2.6 model shows significant improvements in long-range coding capabilities, allowing uninterrupted coding for up to 13 hours and handling over 4000 lines of code. It also enhances API call accuracy and long-running stability for Agent frameworks, and expands Agent cluster capabilities to support up to 300 sub-agents performing 4000 collaborative steps in parallel. Overall, it achieves a 20% performance boost on Kimi's internal code benchmark compared to K2.5.

QHow does Kimi's Prefill-as-a-Service (PrfaaS) architecture improve efficiency?

AKimi's PrfaaS architecture decouples the Prefill (compute-intensive) and Decode (memory bandwidth-intensive) stages of model inference across different heterogeneous clusters, even across data centers. By using a hybrid attention model (Kimi Linear) to reduce KV cache size, it allows the use of high-compute chips (e.g., H200) for Prefill and high-bandwidth chips (e.g., H20) for Decode, connected via VPC dedicated lines. This approach increases throughput by 54%, reduces latency (P90 TTFT) by 64%, and significantly lowers token costs.

QWhat role do Chinese chips play in the future of AI companies like Kimi and DeepSeek?

AChinese chips are becoming a critical infrastructure for AI companies like Kimi and DeepSeek due to export restrictions on high-end GPUs like NVIDIA's H20. Kimi's PrfaaS architecture demonstrates how heterogeneous hardware—including domestic chips—can be used for Prefill and Decode tasks, offering a viable path for cost-efficient inference. As推理 demand grows, reliance on domestic chips shifts from optional to necessary, pushing Chinese AI firms to adapt and collaborate with local chip startups.

QWhat did Elon Musk say about Kimi's research contribution?

AElon Musk tweeted that Kimi's work on Attention Residuals—a technique using attention mechanisms to modify residual connections—was 'an impressive breakthrough by Kimi.'

QHow does Kimi's Agent cluster capability enhance productivity?

AKimi's Agent cluster capability breaks down complex tasks into sub-tasks distributed among specialized agents, enabling parallel processing. This reduces task failure rates and improves efficiency by avoiding serial bottlenecks. In K2.6, it integrates breadth-depth search, large-scale document analysis, long-form writing, and multi-format content generation, supporting up to 300 sub-agents and 4000 collaborative steps, making it a concrete productivity tool for users.

Lecturas Relacionadas

Donald Trump's Company Sold Another Large Batch of Bitcoins!

Donald Trump's company, Trump Media & Technology Group, reportedly transferred another large batch of Bitcoin to the CryptoCom exchange. Blockchain analysis indicates that addresses linked to Trump Media moved approximately 2,628 BTC (worth around $165 million) to the exchange. Prior reports suggested the company had acquired a total of 11,542 BTC at an average price of $118,500. It is claimed that by 2026, about 7,281 BTC had been withdrawn from these addresses, with approximately 4,261 BTC still held on them. The total realized and unrealized losses from Trump Media's Bitcoin investments are estimated to be roughly $555 million. It is important to note that sending Bitcoin to an exchange does not definitively mean the assets were sold. Such transfers could also be for custody, liquidity management, or other financial operations. However, movements from cold wallets to centralized exchanges are commonly viewed as potential sales activity.

cryptonews.ruHace 1 hora(s)

Donald Trump's Company Sold Another Large Batch of Bitcoins!

cryptonews.ruHace 1 hora(s)

Parker Lewis Explains Why Bitcoin Remains the Best Money

Bitcoin analyst Parker Lewis criticized companies promoting themselves as "crypto treasuries" for selling perpetual preferred stock, calling it a distortion of Bitcoin's essence. He argues Bitcoin has no inherent yield, and promises of dividends from such corporate derivatives are risky, often relying on new investor inflows. Lewis highlighted the vast discrepancy between the $300 trillion global credit market and the $1 trillion perpetual preferred stock market, suggesting these instruments shift indefinite risks to retail investors. He also refuted the notion that Bitcoin is "too volatile," stating volatility is a natural mathematical outcome of a fixed-supply asset gaining mass adoption, as new users must bid higher to acquire it. Instead of buying shares of companies like MicroStrategy, Lewis advises direct Bitcoin ownership as safer. The focus on corporate derivatives distracts from the primary threat of fiat currency devaluation. Citing his informal "Ribeye Index," Lewis notes a steep rise in steak prices, indicating real inflation far exceeding official CPI figures. In conclusion, the most prudent strategy against inflation is direct ownership and self-custody of Bitcoin. Chasing corporate yield through crypto treasury stocks multiplies systemic risks, while understanding decentralized money protects savings from macroeconomic turmoil.

cryptonews.ruHace 1 hora(s)

Parker Lewis Explains Why Bitcoin Remains the Best Money

cryptonews.ruHace 1 hora(s)

Why Bitcoin Holds Above $64,000 After Fed's Hard Pause

**Bitcoin Stabilizes Near $64,000 Following Hawkish Fed Pause** The cryptocurrency market, led by Bitcoin, remained stable around $64,000 despite a volatile reaction to the latest U.S. Federal Reserve meeting. The Fed paused interest rates but signaled a hawkish stance, with three committee members voting for an increase—the highest dissent since 2016. This limits risk appetite but hasn't triggered panic selling. Key market highlights include Bitcoin ETFs seeing a net inflow of $32.1 million, breaking a streak of outflows, while Ethereum ETFs experienced outflows of $18.65 million. Liquidations affected about 90,000 traders. Technically, Bitcoin finds support around $63,000-$63,500, with major resistance near $66,000. While its price is about 49% below its all-time high, institutional demand via ETFs and the absence of mass capitulation support a potential recovery scenario in the second half of the year. Major altcoins showed mixed movements, with Solana attracting capital while Ethereum faced selling pressure despite strong on-chain metrics like a growing staking queue. Regulatory news took a pause as the U.S. Senate delayed the CLARITY Act vote until at least autumn. For the final trading day of July, U.S. inflation and consumer spending data will be crucial. Bitcoin's key levels to watch are $63,000 support and $66,000 resistance. Sustained ETF inflows and Bitcoin holding above $63,000 are seen as positive signs for a potential market recovery later in the year.

cryptonews.ruHace 1 hora(s)

Why Bitcoin Holds Above $64,000 After Fed's Hard Pause

cryptonews.ruHace 1 hora(s)

ARK Invest's Cathie Wood Buys 109,129 Circle Shares Worth $6.83 Million

ARK Invest, led by Cathie Wood, purchased approximately 109,129 shares of Circle for nearly $6.83 million across three of its ETFs: ARK Innovation, ARK Next Generation Internet, and ARK Fintech Innovation. This investment followed Circle's recent receipt of a trust charter license from the New York Department of Financial Services for its subsidiary, Circle New York Trust, which CEO Jeremy Allaire described as a long-term company goal. Despite this regulatory approval, Circle's stock (CRCL) fell 2.54% to $62.61 on July 31, as investors may not have viewed the license as a catalyst for growth. In the same period, ARK Invest also bought shares in Tesla, SpaceX, and Nvidia worth about $40.2 million amid a broader tech sell-off, while reducing its holdings in companies like Shopify, Cloudflare, and CrowdStrike.

cryptonews.ruHace 1 hora(s)

ARK Invest's Cathie Wood Buys 109,129 Circle Shares Worth $6.83 Million

cryptonews.ruHace 1 hora(s)

Participants in XRP Fraud Scheme That Stole $9 Million from 71 Investors Arrested

South Korean police have arrested three individuals accused of operating a fraudulent investment platform that stole approximately 3.4 million XRP (worth about $9 million) from 71 investors between October 16 and 23. The suspects promoted the site Fxrpntwork.com through blogs, online articles, and YouTube videos, promising guaranteed principal and monthly returns of 1.5% to 1.8%. Investors were instructed to transfer XRP from Korean exchanges to overseas platforms and then to wallets controlled by the group before the site was shut down. The scammers copied the branding of legitimate projects Flare Network and FXRP to appear credible. Authorities warn that such impersonation frauds, which use familiar branding and urgent promises of guaranteed profits, are a common red flag. Legitimate companies do not solicit cryptocurrency transfers through unsolicited promotions. Seoul police have issued an Interpol Red Notice for a fourth suspect abroad and are investigating others involved in creating and promoting the fraudulent website. While investigators froze 17.3 billion won in assets, approximately 10 billion won was moved during the probe, with wallet analysis revealing transfers totaling 27.3 billion won, suggesting there may be additional unidentified victims and accomplices. The case underscores the organized, cross-border nature of crypto investment fraud.

cryptonews.ruHace 1 hora(s)

Participants in XRP Fraud Scheme That Stole $9 Million from 71 Investors Arrested

cryptonews.ruHace 1 hora(s)

Trading

Spot