In late May 2026, Deepseek internally formed a new Harness team, focused on a code agent product, internally benchmarking against Anthropic's Claude Code. Cui Tianyi, a former star quantitative engineer from Jane Street, joined the team in March, with senior researcher Chen Deli publicly confirming and leading the recruitment. Deepseek's job description clearly states a formula: 'Model + Harness = Agent'. As the capabilities of foundational large models gradually converge, the era of simply competing on parameters is fading. Deepseek's direct entry in building a toolchain team marks a shift in the main battlefield of China's AI competition from 'refining large models' to 'building toolchains and office productivity integration'.
Why is Deepseek Building Its Own Harness?
For a long time, developers' expectations for Deepseek focused on open-sourcing more powerful base models. But strong coding capability doesn't mean developers will adopt it as a productivity tool. What truly changes workflows isn't code answers in a chatbox, but engineering agents that can enter terminals, understand projects, read/write files, run commands, and fix bugs. Before the official move, the developer community had already built various open-source terminal Agents based on Deepseek models. By forming the Harness team now, Deepseek aims to control interface design and training data loop closure, integrating community-developed pathways into official core products.
To understand this strategic intent, one must first clarify what 'Harness' is. For non-technical readers, the term 'Harness' might be unfamiliar. In Deepseek's formula, the model handles reasoning, and the Harness handles everything else. 'Harness' originally means 'horse tack' or 'safety belt' in engineering, extended in the AI field to refer to the 'runtime infrastructure' of an Agent.
For a more accessible analogy, consider a large model as the 'brain' and 'intelligence' of a highly capable employee, while the Harness is that employee's 'job description, KPI evaluation criteria, office blast walls, and toolbox'. It's not a 'scaffolding' assembled before runtime, nor a 'framework' providing building blocks, but a continuously running system. It orchestrates execution loops, dispatches tool calls, manages context, performs security checks, and handles error recovery and state persistence. The large model itself is stateless and lacks environmental interaction capability—it can only receive text input and output text. The Harness compensates for these flaws, enabling the model to truly interact with the external world and execute specific tasks.
Why must foundational model companies master this runtime themselves? The core reason is that Agent products are not just outlets for model capabilities but also training grounds. Deepseek's JD emphasizes 'achieving co-evolution of the model and Harness'. In real-world complex tasks, models encounter various failures due to environmental constraints or tool exceptions. Recording these failure trajectories via the Harness can feed back into model training, creating a flywheel effect. If left to the community, model providers risk losing core application-layer data feedback, becoming mere compute and weight providers.
From an engineering perspective, optimizing the Harness is more critical to Agent success than merely optimizing prompts. According to technical experts, in Agent runtime, tool outputs constitute 67.6% of the content the Agent actually sees in its context, while system prompts account for only 3.4%. This means most of the model's 'view' is occupied by tool call results. If the Harness mishandles tool output formatting or fails to compress redundant information effectively, the model suffers from 'context rot', causing subsequent reasoning quality to plummet.
More critical is the compound error problem. An Agent process with 10 steps, each 99% reliable, has an end-to-end success rate of about 90%. When task complexity rises to 50 steps, the success rate plummets to around 60%. In real-world scenarios like codebase maintenance or enterprise office automation, continuous operations spanning dozens of steps are common. Here, even the strongest model reasoning cannot compensate for the cumulative probability loss. Only through error handling and recovery mechanisms within the Harness can retries or path corrections occur upon step failures. This is the engineering value of Harness and precisely why Deepseek must enter this arena directly.
Tencent Makes Connectors, Alibaba Makes Frontend Inroads: Big Tech's Divergent Toolchain Paths
Deepseek's shift is not an isolated case. According to industry media, strengthening Agent capabilities has become a key development direction for domestic foundational large models in 2026. Foundational models are gradually becoming 'utilities', shifting the competitive main battlefield to the application layer. Other domestic tech giants are also carving out differentiation through toolchains, but with distinct approaches, reflecting their respective ecosystem endowments and target user bases.
In June 2026, Tencent played its new card for enterprise Agents, launching WorkBuddy Enterprise Edition. Its core positioning is a full-scenario workplace intelligent agent desktop workbench, focusing on shifting from individual efficiency to organizational collaboration. WorkBuddy Enterprise Edition supports multi-agent parallelism and business system Connector integration, aiming to seize the unified AI office entry point. Tencent's positioning logic leverages its vast WeCom (Enterprise WeChat) and Tencent Cloud ecosystem. For large enterprises, the pain point in AI office automation isn't the ultimate experience of a single-point tool, but whether it can integrate with internal siloed office systems. By building connectors, Tencent enables Agents to directly orchestrate enterprise data and workflows, focusing on organization-level collaboration and complex task delivery. This path's strength lies in high barriers; once integrated into core business processes, switching costs are immense. The challenge is the need for robust enterprise service capabilities and customized support.
Alibaba has taken a different path, choosing to lower automation barriers on the web frontend. Alibaba open-sourced the purely frontend, in-browser GUI Agent framework, PageAgent. This framework requires no backend deployment; a single line of code allows any website to integrate AI operator capabilities. Alibaba's positioning logic is empowering web developers, instantly transforming any webpage into an AI-native application. Given the reality that many legacy enterprise systems lack API interfaces, achieving automation through frontend DOM manipulation is a pragmatic, disruptive path. This approach's advantage is its lightweight, easy integration nature, enabling rapid coverage of a vast long tail of websites. However, frequent changes to frontend DOM structures pose stability challenges, demanding higher error recovery capabilities from the Harness.
In contrast, companies are no longer solely competing on model benchmarks but building toolchains based on their unique ecosystem strengths. Tencent focuses on connectors, Alibaba on frontend penetration, while Deepseek starts with the most critical pain point for developers: code engineering scenarios. This divergence indicates that China's AI industry has recognized there is no perfect, universal Agent—only vertical solutions honed through robust Harness engineering for specific scenarios. For enterprise procurement, choosing a toolchain essentially means choosing an automation path: deep integration with an office ecosystem, flexible embedding into existing web systems, or empowering developer engineering workflows.
Viktor's $20M ARR Proof: Enterprises Will Pay for Autonomous Execution
The maturation of toolchains is changing the paradigm of AI's role in the office. The native Copilot logic is 'draft and wait for human completion'—AI generates copy or code, with the final step requiring human intervention for modification and execution. In this mode, AI is merely an efficiency tool, not a true labor replacement. Employees must constantly monitor AI output for verification and implementation, which actually increases cognitive load.
Overseas markets already show clear signals of a paradigm shift. As a reference point for global trends, Poland-based AI office automation company Viktor, positioned as an AI employee within Slack, achieved a $20 million Annual Recurring Revenue (ARR) without a sales team, serving 30,000 companies, and secured a $75 million Series A funding round in May 2026. Viktor's model represents the end state of new AI employees: possessing a cloud computer, capable of long-duration continuous operation, firmly grasping massive context, and delivering results directly.
Viktor is positioned as a Tier 3 AI Coworker, meaning it handles not simple Q&A but complex tasks like marketing audits, ad campaign management, lead research—requiring multi-step, long-running operations. Enterprises show strong willingness to pay for this type of AI that requires no final human confirmation and can operate continuously for long periods. The explosion of such commercial data proves the value anchor of office automation has shifted from 'assistive generation' to 'autonomous execution'.
Domestic manufacturers' focus on Harness and Agent toolchains aims to capture this trend. When the Harness provides sufficient safety rails, state persistence, and error recovery capabilities, AI can evolve from an 'intern' requiring constant human supervision to an 'outsourcing partner' capable of independently delivering work outcomes. Enterprise procurement focus will shift from model parameter size to whether the Agent can run stably for 8 hours without crashing, automatically handle API rate limits, and adapt to webpage structure changes. For developers, this means the focus of building AI applications shifts from 'how to write good prompts' to 'how to design a robust runtime environment'.
Token Explosion and the Engineering Barriers of 'Thick Frameworks'
As competition shifts to toolchains, the challenges faced by enterprises and developers in practical implementation haven't decreased but have become more focused on the engineering layer.
First and foremost is the Token explosion problem. Agents running for extended durations, in their 'think, act, feedback' loops, are prone to rapidly inflating context due to redundant tool outputs. This is widely discussed in developer communities, as it not only drives up inference costs but also causes model attention to scatter, drastically increasing task failure rates. For example, in a web scraping task, if the Harness feeds the entire webpage's HTML source code unchanged into the context, the model quickly gets lost in redundant information, forgetting the original task objective. Therefore, the Harness's context compression and memory management capabilities become a core consideration for enterprise procurement. A superior Harness must know which historical information can be discarded and which tool return results need summarization. This tests deep engineering architectural capabilities, not the model's inherent intelligence.
This also heightens developer wariness towards 'thin-shell' frameworks. If the Harness launched by a large model provider is merely a simple API wrapper offering basic chat windows and tool-calling interfaces, it will lack practical debugging value. The fragility of production environments demands Harness features like sandbox isolation, fine-grained permission control, and checkpoint/restart—characteristics of a 'thick framework'. Only a runtime with solid engineering barriers can truly meet the stability needs of enterprise-grade applications. For instance, in code execution scenarios, the Harness must provide a safe sandbox environment to prevent malicious code generated by the model from harming the host system. For long-running tasks, it must support checkpoint/restart to avoid restarting entire tasks due to network fluctuations.
Furthermore, geopolitical factors create a significant market vacuum for domestic Harness solutions. Top overseas engineering agent products like Claude Code restrict access for mainland China and Chinese-affiliated enterprises. Unable to use these top tools directly, domestic developers can only seek domestic alternatives. Deepseek forming its Harness team is not just following a technical trend but also responding to this vast replacement demand.
For enterprises and developers, understanding the value of Harness means when selecting AI products, they won't be dazzled by flashy demo conversations but will instead probe into its error recovery mechanisms, context management strategies, and whether it can truly integrate into existing workflows. In the toolchain competition stage, enterprises should prioritize evaluating vendors' engineering delivery capabilities and ecosystem compatibility over simply comparing model benchmarks. Developers should focus on the Harness framework's openness and the completeness of its debugging toolchain, choosing platforms that offer deeply controllable runtimes.







