First Long-Horizon Doc2Repo Training Dataset: Code Agents Move Beyond Bug Fixing and Begin Creating Repositories

marsbitОпубліковано о 2026-06-25Востаннє оновлено о 2026-06-25

Анотація

With the advancement of LLM Code Agents, the research focus is shifting towards long-horizon, real-world tasks, moving beyond simple bug fixes to full repository generation. To address this, researchers from Renmin University of China introduced the DeNovoSWE dataset. This dataset focuses on long-term software engineering tasks, specifically the "document-to-repository" challenge—generating an entire, executable code repository from a task description. The DeNovoSWE construction method employs a Divide & Conquer approach. It breaks down target repositories into core capabilities and uses a multi-agent Draft-Critic-Repair workflow to automatically generate high-quality, evaluation-aligned task documents. The dataset also implements difficulty-aware filtering to balance quality and diversity. The result is a high-quality, anti-leakage dataset of 4,818 instances. Experiments show that models trained on DeNovoSWE achieve significant improvements in long-horizon repository generation. For instance, Qwen3-30B-A3B-Instruct's performance on the BeyondSWE-Doc2Repo benchmark increased from 5.8% to 47.2%, and on NL2RepoBench from 4.3% to 23.0%. Similar gains were observed with stronger backbones, demonstrating that dedicated long-horizon training data is crucial for advancing Code Agents from maintainers to architects capable of planning and building complete software projects from scratch.

With the continuous improvement of LLM Code Agent capabilities, more and more researchers are realizing it's time to advance to the next stage of long-horizon tasks that are closer to real-world scenarios. Consequently, some benchmarks for evaluating long-horizon tasks have emerged, such as NL2RepoBench and BeyondSWE. The expected role of Code Agents is gradually shifting from repository maintainers to architects, capable of planning and completing long-horizon coding tasks for entire repositories.

Recently, the Gaoling School of Artificial Intelligence at Renmin University of China completed related research and officially released the DeNovoSWE dataset, focusing on long-horizon software engineering tasks, particularly repository-level code generation from scratch.

Paper link: https://arxiv.org/pdf/2606.10728

Repository link: https://github.com/AweAI-Team/DeNovoSWE

Data link: https://huggingface.co/collections/AweAI-Team/denovoswe

Through the mechanisms of Divide & Conquer and Critic & Repair, a high-quality dataset was constructed, successfully achieving scaling for long-horizon SWE tasks. This effort resulted in DeNovoSWE, an open-source, high-quality long-horizon SWE task dataset containing 4,818 real-world instances. This achievement provides large-scale data for training Code Agents' long-horizon capabilities, significantly enhancing their performance on such tasks.

The paper also proposes methods based on difficulty score filtering, effectively alleviating the trade-off between the proportion of difficult problems and trajectory quality.

Experiments show that the Qwen3-30B-A3B-Instruct model trained on DeNovoSWE improved from 5.8% to 47.2% on BeyondSWE-Doc2Repo and from 4.3% to 23.0% on NL2RepoBench, demonstrating the significant boost in repository-level code generation capabilities brought by long-horizon data.

Rebuilding an Entire Repository from a Single Document

Over the past year, with the scaling of large-scale SWE data in works like Scale-SWE, code agents have rapidly progressed on real software engineering tasks like SWE-bench. But as models become increasingly adept at "fixing an issue" or "changing a few lines of buggy code," a more critical question arises: Do agents truly possess long-horizon software engineering capabilities? Judging from the performance of frontier models on BeyondSWE-Doc2Repo and NL2RepoBench, the results are not ideal.

Real-world software development often isn't about modifying a single function or adding a conditional statement. It involves understanding requirements, planning architecture, creating files, designing APIs, handling dependencies, connecting modules, and ultimately making the entire repository run successfully in tests.

In other words, the real challenge lies in long-horizon repository-level generation: starting from a task document and generating a complete, executable, and verifiable software repository. This is precisely the problem DeNovoSWE aims to solve.

High-Quality "Generate from Scratch" Task Documents

In document-to-repository generation, the document is not just a README, nor a simple API list. It is essentially the sole task entry point for the agent to rebuild the entire repository.

A high-quality task document needs to meet at least two core standards.

First, it must be well-organized.

Repository-level tasks are inherently complex, involving multiple modules, interfaces, configurations, data structures, and interaction flows. If the document merely piles up function descriptions, the agent can easily get lost in fragmented information. Therefore, the document should first provide a clear overview of the repository, then divide into chapters based on capabilities or workflows, ensuring each part corresponds to a clear functional boundary.

Second, it must be written from the perspective of reliable evaluation.

The document cannot be too sparse, otherwise the task becomes an underdefined problem, potentially requiring the model to guess aimlessly to pass evaluation. Nor can it be too detailed, as that would directly leak implementation details, making the task unchallenging.

A truly high-quality document should describe the key behaviors on which evaluation depends: including import paths, public APIs, inputs and outputs, default parameters, exception behaviors, configuration items, pattern strings, return fields, etc., while also outlining the general functionalities to be implemented. In other words, the document should be sufficient for the agent to reproduce testable behaviors, but it should not become a copy of the implementation code.

This is also the core idea of DeNovoSWE: making documents readable, implementable, and verifiable.

The DeNovoSWE Method

DeNovoSWE frames "generating a complete repository from a document" as a large-scale, verifiable long-horizon software engineering task. It does not rely on manually written documents but automatically constructs high-quality instances through a sandboxed multi-agent workflow. The entire method can be summarized in two steps: Divide and Conquer.

In the Divide stage, the system first analyzes the target repository, decomposing it into multiple repository capabilities.

Each capability corresponds to a core function or workflow within the repository, such as authentication and connection, data reading/writing, batch processing, export flows, etc. This way, the originally massive repository generation problem is split into several structurally clear document chapters.

Simultaneously, DeNovoSWE runs the original unit tests and collects execution traces to identify which functions, classes, and interfaces actually impact evaluation. This further distinguishes between direct components, core indirect components, and non-core indirect components: interfaces directly called by tests must be documented in detail; core indirect components that affect observable behaviors also need coverage; while non-core internal implementations can be left for the agent to handle freely.

In the Conquer stage, DeNovoSWE uses a Draft-Critic-Repair mechanism to generate documents for each capability one by one. The Draft agent writes an initial draft; the Critic agent checks for omissions of key APIs, behavioral contracts, or structural information; the Repair agent then fixes the document based on the feedback. This cycle iterates until each capability chapter is clear, complete, and aligned with evaluation.

Finally, the documents for different capabilities are merged into a single, comprehensive task document, serving as the sole basis for the agent to generate the repository from scratch.

Difficulty: Why is This a Long-Horizon Task?

The difficulty of DeNovoSWE tasks stems from a fundamental change: it's no longer issue-level fixing, but whole-repository generation.

In traditional SWE tasks, agents typically face an existing repository, needing only to locate a bug, modify local code, and pass tests.

In DeNovoSWE, the agent faces a cleaned environment: the original source code and tests are removed, git history is reset, and potential leakage channels like caches, site-packages residues, pip wheels, temporary compilation artifacts, etc., are also cleared. This means the agent must truly rely on the document to complete the entire repository rebuild. It needs to plan the project structure, create module files, define public interfaces, implement cross-file interactions, handle dependencies and configurations, and continuously fix errors across multiple rounds of editing and test feedback.

Any deviation in an API signature, return field, exception type, or default behavior can cause test failures. Errors can also accumulate over the long horizon: an early poorly designed module can affect multiple subsequent files and call chains.

To further address the difficulty variance across different repositories, DeNovoSWE also proposes difficulty-aware trajectory filtering. In simple terms, easy tasks should require a higher pass rate, while difficult tasks should not be entirely discarded just for failing to achieve a perfect score. DeNovoSWE sets different filtering thresholds for different difficulty intervals based on structural complexity and LLM difficulty assessment, thereby balancing quality and diversity.

This is particularly important for long-horizon tasks: the more complex the repository, the harder it is to pass all tests in one go. Yet, the trajectories from challenging repositories, even with low scores or partial success, still contain valuable long-horizon planning and implementation capabilities.

Experimental Results

DeNovoSWE ultimately constructed 4,818 high-quality document-to-repository task instances. It is an executable, evaluable, and trainable long-horizon software engineering environment.

Experimental results show that DeNovoSWE brings significant improvements to models' long-horizon repository generation capabilities. For Qwen3-30B-A3B-Instruct, the original model scored only 5.8% on BeyondSWE-Doc2Repo and 4.3% on NL2RepoBench. Training with conventional issue-level SWE data like Scale-SWE-Agent improved these to 29.2% and 18.3%, indicating that general SWE data does have transfer effects. However, when the model was trained using DeNovoSWE, performance further increased to 47.2% and 23.0%.

This demonstrates that data oriented towards "fixing bugs" cannot fully replace data oriented towards "generating complete repositories" for long-horizon tasks. To truly teach agents repository-level engineering, specialized training environments built for long-horizon tasks are needed.

On the stronger Qwen3.5-35B-A3B backbone, DeNovoSWE similarly brought stable gains: BeyondSWE-Doc2Repo improved from 43.8% to 50.0%, and NL2RepoBench from 23.5% to 27.1%. This further indicates that the benefits of DeNovoSWE are not due to accidental adaptation to a specific model, but stem from the high-quality long-horizon data itself.

Conclusion

The next stage for code agents is not just about fixing individual issues faster, but about understanding documents, planning architecture, organizing modules, implementing interfaces, and ultimately generating a complete, runnable software repository.

DeNovoSWE systematically frames this goal into a trainable, verifiable, and scalable dataset. It answers a key question: What kind of data can truly train agents with long-horizon software engineering capabilities?

The answer is not more fragmented code, nor simpler problems, but high-quality, structured, evaluation-aligned, anti-leakage, full-repository generation tasks.

Starting from a single document, rebuild the entire repository. This is the threshold that long-horizon code agents need to cross.

Reference: https://arxiv.org/pdf/2606.10728

This article is from the WeChat public account "AI Era," edited by LRST.

Трендові криптовалюти

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

Пов'язані питання

QWhat is the main contribution of the research from Renmin University of China's Gaoling School of Artificial Intelligence?

AThe research introduces and releases the DeNovoSWE dataset, which is the first long-horizon Doc2Repo training set focused on repository-level code generation from scratch in software engineering tasks.

QHow does the DeNovoSWE dataset address the challenge of long-horizon software engineering tasks?

AIt uses a sandboxed multi-agent workflow based on 'Divide & Conquer' and 'Draft-Critic-Repair' mechanisms to automatically construct high-quality task documents, ensuring they are well-organized and evaluation-aligned for whole-repository generation.

QWhat were the performance improvements observed after training a model on the DeNovoSWE dataset?

AThe Qwen3-30B-A3B-Instruct model showed significant improvement, increasing its performance on BeyondSWE-Doc2Repo from 5.8% to 47.2% and on NL2RepoBench from 4.3% to 23.0%.

QWhat are the two core standards mentioned for a high-quality task document in document-to-repository generation?

AFirst, the document must be well-organized, providing a clear overview and structured chapters. Second, it must be evaluation-aligned, describing key behaviors for verification without leaking implementation details.

QWhat is the purpose of the difficulty-aware trajectory filtering mechanism in DeNovoSWE?

AIt sets different filtering thresholds for tasks of varying difficulty levels to balance quality and diversity, ensuring that valuable but partially successful trajectories from complex repositories are not discarded.

Пов'язані матеріали

Ripple Launches RLUSD in Japan as Regulated Stablecoin Expansion Gains Momentum

Ripple has officially launched its regulated U.S. dollar stablecoin, RLUSD, in Japan through the SBI VC Trade platform (VCTRADE). This follows approval from Japan's Financial Services Agency under the country's Type 4 Electronic Payment Instrument framework. The launch marks a significant expansion for RLUSD into a major, technologically advanced market and deepens Ripple's long-standing partnership with the SBI Group, which began in 2016. In Japan, RLUSD will be used for cross-border payments, tokenization, and collateral management. The stablecoin is backed by U.S. dollar deposits, Treasury holdings, and cash equivalents, with regular third-party attestations. SBI VC Trade will offer free deposits and withdrawals for the asset. Since its late-2024 debut, RLUSD's market capitalization has grown to approximately $1.7 billion, making it the second USD stablecoin on the VCTRADE platform after USD Coin. Ripple highlights Japan's clear regulatory environment as key to this strategic growth.

TheNewsCrypto5 хв тому

Ripple Launches RLUSD in Japan as Regulated Stablecoin Expansion Gains Momentum

TheNewsCrypto5 хв тому

Sam Altman's Personal Alchemy of Wealth: Investing in 400 Companies, Over 10 Deeply Tied to OpenAI

The article investigates Sam Altman's personal wealth strategy, centered around his investments in approximately 400 companies while serving as OpenAI's CEO. Despite not holding direct equity in OpenAI, Altman has built a vast portfolio, with at least 10 of his investments having commercial ties or ongoing negotiations with OpenAI. This creates a complex network of potential conflicts of interest, drawing scrutiny from U.S. congressional committees and state attorneys general. Key investments highlighted include the anti-aging startup Retro Biosciences (valued at $258 million for his stake as of late last year) and the chipmaker Cerebras, whose value soared following an OpenAI procurement deal. His most significant financial gain is linked to the nuclear fusion company Helion, where a recent funding round reportedly increased his stake's value to at least $4.1 billion. The article details a decade-long relationship between Altman, Helion, and OpenAI, including a controversial non-binding power purchase agreement and Altman's efforts to secure investments from OpenAI and its backer SoftBank for Helion. Other points include internal investigations at Tools for Humanity (developer of Worldcoin) and OpenAI's massive contracts with tech giants like Nvidia. According to Forbes, Altman's net worth is around $3.4 billion, ranking him 1251st globally—a rise of over 1400 places since 2024. OpenAI's board states that Altman's external dealings are transparent and potential conflicts are carefully managed.

Odaily星球日报20 хв тому

Sam Altman's Personal Alchemy of Wealth: Investing in 400 Companies, Over 10 Deeply Tied to OpenAI

Odaily星球日报20 хв тому

Former SpaceX Engineer Reconstructs Financial Execution System Using First Principles

Former SpaceX engineer Lex Li applies "First Principles Thinking" to financial infrastructure with Plan Execution Lab, recently raising angel funding at a $50M post-money valuation. The team argues that the core function of finance is capital allocation, and the critical gap is not in trading but in execution, which remains highly manual and fragmented. While assets, liquidity, and settlement have migrated on-chain, execution workflows (monitoring, risk management, liquidity coordination) are still human-native. In an era of accelerating AI agents, strategy decay is rapid, shifting the competitive edge from having the best strategy to having the most robust execution network. Plan Execution Lab introduces two core components: 1. **PlanX**: A Financial Execution Protocol designed as infrastructure for the migration from CEX to DEX, providing on-chain execution capabilities, liquidity access, risk management, and capital orchestration. 2. **Xgent**: An Autonomous Financial Runtime. Users define investment intents, risk preferences, and constraints; Xgent automatically constructs an execution graph, verifies it, and handles ongoing execution and optimization—streamlining the process from Intent to Autonomous Execution. The long-term vision is to create the "Bloomberg Terminal for Autonomous Finance"—a shared operating environment and execution network built collectively by participants like execution nodes, liquidity providers, and autonomous agents. The future of finance, they contend, belongs not to isolated algorithms but to open, collaborative execution networks.

marsbit54 хв тому

Former SpaceX Engineer Reconstructs Financial Execution System Using First Principles

marsbit54 хв тому

Former SpaceX Engineer Reconstructs Financial Execution System from First Principles

Plan Execution Lab, a financial infrastructure project founded by former SpaceX engineer Lex Li, has raised angel funding at a $50M post-money valuation. The startup is applying "first principles thinking" from Li's SpaceX experience to rethink financial market execution. Their analysis posits that while assets, liquidity, and settlement have moved on-chain, the execution layer remains fundamentally human-dependent and fragmented. In the era of AI Agents, strategy advantages decay rapidly, shifting the competitive edge from isolated algorithms to robust **execution networks**. Plan Execution Lab's solution is a two-part system: **PlanX**, a Financial Execution Protocol designed to facilitate the migration from centralized exchanges (CEX) to on-chain markets by providing core on-chain execution capabilities; and **Xgent**, an Autonomous Financial Runtime. Xgent allows users to define investment goals and constraints, then autonomously constructs and manages the execution logic—moving from **Intent to Execution Graph to Verification to Autonomous Execution**. The long-term vision is to create the "Bloomberg Terminal for Autonomous Finance"—an operating environment not for humans, but for agents and execution nodes. The future financial system, they argue, will be a collaborative network built by diverse participants contributing execution capabilities, not secret strategies. The core competition will shift to who builds the most powerful and adaptive execution network.

链捕手55 хв тому

Former SpaceX Engineer Reconstructs Financial Execution System from First Principles

链捕手55 хв тому

The Altcoin Vector #60

This article, titled "The Altcoin Vector #60," is an exclusive subscriber-only publication. The content is not publicly accessible, as indicated by the prompt for existing subscribers to log in to view the full text. Therefore, no substantive summary can be generated from the provided excerpt beyond noting its restricted access status.

insights.glassnode1 год тому

Торгівля

Спот

Ф'ючерси

Обговорення

Ласкаво просимо до спільноти HTX. Тут ви можете бути в курсі останніх подій розвитку платформи та отримати доступ до професійної ринкової інформації. Нижче представлені думки користувачів щодо ціни RE (RE).