First Long-Horizon Doc2Repo Training Dataset: Code Agents Move Beyond Bug Fixing and Begin Creating Repositories

marsbitPubblicato 2026-06-25Pubblicato ultima volta 2026-06-25

Introduzione

With the advancement of LLM Code Agents, the research focus is shifting towards long-horizon, real-world tasks, moving beyond simple bug fixes to full repository generation. To address this, researchers from Renmin University of China introduced the DeNovoSWE dataset. This dataset focuses on long-term software engineering tasks, specifically the "document-to-repository" challenge—generating an entire, executable code repository from a task description. The DeNovoSWE construction method employs a Divide & Conquer approach. It breaks down target repositories into core capabilities and uses a multi-agent Draft-Critic-Repair workflow to automatically generate high-quality, evaluation-aligned task documents. The dataset also implements difficulty-aware filtering to balance quality and diversity. The result is a high-quality, anti-leakage dataset of 4,818 instances. Experiments show that models trained on DeNovoSWE achieve significant improvements in long-horizon repository generation. For instance, Qwen3-30B-A3B-Instruct's performance on the BeyondSWE-Doc2Repo benchmark increased from 5.8% to 47.2%, and on NL2RepoBench from 4.3% to 23.0%. Similar gains were observed with stronger backbones, demonstrating that dedicated long-horizon training data is crucial for advancing Code Agents from maintainers to architects capable of planning and building complete software projects from scratch.

With the continuous improvement of LLM Code Agent capabilities, more and more researchers are realizing it's time to advance to the next stage of long-horizon tasks that are closer to real-world scenarios. Consequently, some benchmarks for evaluating long-horizon tasks have emerged, such as NL2RepoBench and BeyondSWE. The expected role of Code Agents is gradually shifting from repository maintainers to architects, capable of planning and completing long-horizon coding tasks for entire repositories.

Recently, the Gaoling School of Artificial Intelligence at Renmin University of China completed related research and officially released the DeNovoSWE dataset, focusing on long-horizon software engineering tasks, particularly repository-level code generation from scratch.

Paper link: https://arxiv.org/pdf/2606.10728

Repository link: https://github.com/AweAI-Team/DeNovoSWE

Data link: https://huggingface.co/collections/AweAI-Team/denovoswe

Through the mechanisms of Divide & Conquer and Critic & Repair, a high-quality dataset was constructed, successfully achieving scaling for long-horizon SWE tasks. This effort resulted in DeNovoSWE, an open-source, high-quality long-horizon SWE task dataset containing 4,818 real-world instances. This achievement provides large-scale data for training Code Agents' long-horizon capabilities, significantly enhancing their performance on such tasks.

The paper also proposes methods based on difficulty score filtering, effectively alleviating the trade-off between the proportion of difficult problems and trajectory quality.

Experiments show that the Qwen3-30B-A3B-Instruct model trained on DeNovoSWE improved from 5.8% to 47.2% on BeyondSWE-Doc2Repo and from 4.3% to 23.0% on NL2RepoBench, demonstrating the significant boost in repository-level code generation capabilities brought by long-horizon data.

Rebuilding an Entire Repository from a Single Document

Over the past year, with the scaling of large-scale SWE data in works like Scale-SWE, code agents have rapidly progressed on real software engineering tasks like SWE-bench. But as models become increasingly adept at "fixing an issue" or "changing a few lines of buggy code," a more critical question arises: Do agents truly possess long-horizon software engineering capabilities? Judging from the performance of frontier models on BeyondSWE-Doc2Repo and NL2RepoBench, the results are not ideal.

Real-world software development often isn't about modifying a single function or adding a conditional statement. It involves understanding requirements, planning architecture, creating files, designing APIs, handling dependencies, connecting modules, and ultimately making the entire repository run successfully in tests.

In other words, the real challenge lies in long-horizon repository-level generation: starting from a task document and generating a complete, executable, and verifiable software repository. This is precisely the problem DeNovoSWE aims to solve.

High-Quality "Generate from Scratch" Task Documents

In document-to-repository generation, the document is not just a README, nor a simple API list. It is essentially the sole task entry point for the agent to rebuild the entire repository.

A high-quality task document needs to meet at least two core standards.

First, it must be well-organized.

Repository-level tasks are inherently complex, involving multiple modules, interfaces, configurations, data structures, and interaction flows. If the document merely piles up function descriptions, the agent can easily get lost in fragmented information. Therefore, the document should first provide a clear overview of the repository, then divide into chapters based on capabilities or workflows, ensuring each part corresponds to a clear functional boundary.

Second, it must be written from the perspective of reliable evaluation.

The document cannot be too sparse, otherwise the task becomes an underdefined problem, potentially requiring the model to guess aimlessly to pass evaluation. Nor can it be too detailed, as that would directly leak implementation details, making the task unchallenging.

A truly high-quality document should describe the key behaviors on which evaluation depends: including import paths, public APIs, inputs and outputs, default parameters, exception behaviors, configuration items, pattern strings, return fields, etc., while also outlining the general functionalities to be implemented. In other words, the document should be sufficient for the agent to reproduce testable behaviors, but it should not become a copy of the implementation code.

This is also the core idea of DeNovoSWE: making documents readable, implementable, and verifiable.

The DeNovoSWE Method

DeNovoSWE frames "generating a complete repository from a document" as a large-scale, verifiable long-horizon software engineering task. It does not rely on manually written documents but automatically constructs high-quality instances through a sandboxed multi-agent workflow. The entire method can be summarized in two steps: Divide and Conquer.

In the Divide stage, the system first analyzes the target repository, decomposing it into multiple repository capabilities.

Each capability corresponds to a core function or workflow within the repository, such as authentication and connection, data reading/writing, batch processing, export flows, etc. This way, the originally massive repository generation problem is split into several structurally clear document chapters.

Simultaneously, DeNovoSWE runs the original unit tests and collects execution traces to identify which functions, classes, and interfaces actually impact evaluation. This further distinguishes between direct components, core indirect components, and non-core indirect components: interfaces directly called by tests must be documented in detail; core indirect components that affect observable behaviors also need coverage; while non-core internal implementations can be left for the agent to handle freely.

In the Conquer stage, DeNovoSWE uses a Draft-Critic-Repair mechanism to generate documents for each capability one by one. The Draft agent writes an initial draft; the Critic agent checks for omissions of key APIs, behavioral contracts, or structural information; the Repair agent then fixes the document based on the feedback. This cycle iterates until each capability chapter is clear, complete, and aligned with evaluation.

Finally, the documents for different capabilities are merged into a single, comprehensive task document, serving as the sole basis for the agent to generate the repository from scratch.

Difficulty: Why is This a Long-Horizon Task?

The difficulty of DeNovoSWE tasks stems from a fundamental change: it's no longer issue-level fixing, but whole-repository generation.

In traditional SWE tasks, agents typically face an existing repository, needing only to locate a bug, modify local code, and pass tests.

In DeNovoSWE, the agent faces a cleaned environment: the original source code and tests are removed, git history is reset, and potential leakage channels like caches, site-packages residues, pip wheels, temporary compilation artifacts, etc., are also cleared. This means the agent must truly rely on the document to complete the entire repository rebuild. It needs to plan the project structure, create module files, define public interfaces, implement cross-file interactions, handle dependencies and configurations, and continuously fix errors across multiple rounds of editing and test feedback.

Any deviation in an API signature, return field, exception type, or default behavior can cause test failures. Errors can also accumulate over the long horizon: an early poorly designed module can affect multiple subsequent files and call chains.

To further address the difficulty variance across different repositories, DeNovoSWE also proposes difficulty-aware trajectory filtering. In simple terms, easy tasks should require a higher pass rate, while difficult tasks should not be entirely discarded just for failing to achieve a perfect score. DeNovoSWE sets different filtering thresholds for different difficulty intervals based on structural complexity and LLM difficulty assessment, thereby balancing quality and diversity.

This is particularly important for long-horizon tasks: the more complex the repository, the harder it is to pass all tests in one go. Yet, the trajectories from challenging repositories, even with low scores or partial success, still contain valuable long-horizon planning and implementation capabilities.

Experimental Results

DeNovoSWE ultimately constructed 4,818 high-quality document-to-repository task instances. It is an executable, evaluable, and trainable long-horizon software engineering environment.

Experimental results show that DeNovoSWE brings significant improvements to models' long-horizon repository generation capabilities. For Qwen3-30B-A3B-Instruct, the original model scored only 5.8% on BeyondSWE-Doc2Repo and 4.3% on NL2RepoBench. Training with conventional issue-level SWE data like Scale-SWE-Agent improved these to 29.2% and 18.3%, indicating that general SWE data does have transfer effects. However, when the model was trained using DeNovoSWE, performance further increased to 47.2% and 23.0%.

This demonstrates that data oriented towards "fixing bugs" cannot fully replace data oriented towards "generating complete repositories" for long-horizon tasks. To truly teach agents repository-level engineering, specialized training environments built for long-horizon tasks are needed.

On the stronger Qwen3.5-35B-A3B backbone, DeNovoSWE similarly brought stable gains: BeyondSWE-Doc2Repo improved from 43.8% to 50.0%, and NL2RepoBench from 23.5% to 27.1%. This further indicates that the benefits of DeNovoSWE are not due to accidental adaptation to a specific model, but stem from the high-quality long-horizon data itself.

Conclusion

The next stage for code agents is not just about fixing individual issues faster, but about understanding documents, planning architecture, organizing modules, implementing interfaces, and ultimately generating a complete, runnable software repository.

DeNovoSWE systematically frames this goal into a trainable, verifiable, and scalable dataset. It answers a key question: What kind of data can truly train agents with long-horizon software engineering capabilities?

The answer is not more fragmented code, nor simpler problems, but high-quality, structured, evaluation-aligned, anti-leakage, full-repository generation tasks.

Starting from a single document, rebuild the entire repository. This is the threshold that long-horizon code agents need to cross.

Reference: https://arxiv.org/pdf/2606.10728

This article is from the WeChat public account "AI Era," edited by LRST.

Crypto di tendenza

Domande pertinenti

QWhat is the main contribution of the research from Renmin University of China's Gaoling School of Artificial Intelligence?

AThe research introduces and releases the DeNovoSWE dataset, which is the first long-horizon Doc2Repo training set focused on repository-level code generation from scratch in software engineering tasks.

QHow does the DeNovoSWE dataset address the challenge of long-horizon software engineering tasks?

AIt uses a sandboxed multi-agent workflow based on 'Divide & Conquer' and 'Draft-Critic-Repair' mechanisms to automatically construct high-quality task documents, ensuring they are well-organized and evaluation-aligned for whole-repository generation.

QWhat were the performance improvements observed after training a model on the DeNovoSWE dataset?

AThe Qwen3-30B-A3B-Instruct model showed significant improvement, increasing its performance on BeyondSWE-Doc2Repo from 5.8% to 47.2% and on NL2RepoBench from 4.3% to 23.0%.

QWhat are the two core standards mentioned for a high-quality task document in document-to-repository generation?

AFirst, the document must be well-organized, providing a clear overview and structured chapters. Second, it must be evaluation-aligned, describing key behaviors for verification without leaking implementation details.

QWhat is the purpose of the difficulty-aware trajectory filtering mechanism in DeNovoSWE?

AIt sets different filtering thresholds for tasks of varying difficulty levels to balance quality and diversity, ensuring that valuable but partially successful trajectories from complex repositories are not discarded.

Letture associate

Tidal Investment: We Remain Bullish on the AI Industry Chain, But the Reasons Have Changed

Tidal Investment remains optimistic about the AI industry chain, but the rationale has shifted. The market narrative has changed. While recent large-scale IPOs (e.g., SpaceX) and major fundraising plans by tech giants like Alphabet and Meta have caused some nervousness, this isn't a sign of an AI peak. The focus has moved from the initial question of AI's viability to the sustainability of massive investment cycles. The key players—primarily the major cloud providers—are not slowing down; their capital expenditure (Capex) guidance for 2026 has been increased across the board (e.g., Alphabet to $180B, Amazon to $200B). This investment cycle is proving resilient and difficult to stop. Unlike traditional hardware cycles, current AI Capex is distributed across multiple physical layers—computing, memory, networking, and critically, power infrastructure. Bottlenecks are shifting from chips to elements like electricity, transformers, and cooling systems, which have much longer lead times and cannot be easily pre-built like fiber optics during the dot-com bubble. Supply chain data (e.g., Eaton's 240% YoY data center orders) confirms this broad-based, project-driven expansion. Market concerns are acknowledged but viewed differently. First, while Capex growth currently outpaces revenue growth, raising ROI questions, this mirrors the early scaling phase of cloud computing itself. A change in view would require concrete signals like downward Capex revisions or missed AI product targets, which haven't materialized by mid-2026. Second, comparisons to the 2000 dot-com bust are flawed. That crash was driven by a massive, parallel oversupply of cheap capacity (fiber). The current cycle faces *supply constraints* in critical, capital-intensive physical infrastructure that cannot be overbuilt as easily. In conclusion, the wave of fundraising reflects the next, more complex act of the AI story. Physical bottlenecks and sustained high Capex plans suggest this is not the finale but an ongoing, capital-intensive build-out phase. The script has changed, but the play is far from over.

marsbit1 h fa

Tidal Investment: We Remain Bullish on the AI Industry Chain, But the Reasons Have Changed

marsbit1 h fa

Tidal Investment: We Remain Bullish on the AI Industry Chain, But for Different Reasons Now

Tidal Investments remains optimistic about the AI industry chain, but the rationale has shifted. The market is concerned about massive concurrent fundraising by tech giants like SpaceX, OpenAI, Alphabet, and Meta, fearing an AI peak. However, the authors argue this signals the next act of AI development, not its end. Capital expenditure (Capex) from major cloud providers (Alphabet, Amazon, Meta, Microsoft, Oracle) continues to surge aggressively into 2026. This investment cycle is more resilient than past hardware cycles due to its scale and complexity. Bottlenecks have shifted from chips to critical physical infrastructure like power grids, transformers, cooling, and data center construction—areas with long lead times and limited capacity for rapid expansion. Supply chain data (e.g., Eaton's orders) confirms substantial, tangible progress. Key market concerns are addressed: 1. **ROI vs. Capex Growth**: While Capex growth outpaces revenue, the authors note cloud giants have historically overcome similar phases through scale. The cycle will only be in danger if Capex guidance is cut, orders are canceled, or AI product demand falters—none of which are currently observed. 2. **Comparison to the 2000 Dot-com Bubble**: Unlike the telecom bubble, where cheap, oversupplied fiber crashed prices, AI infrastructure (especially power) is constrained, customized, and subject to lengthy approvals, making a similar supply glut and crash unlikely. In conclusion, the wave of fundraising reflects the immense, ongoing capital needs for AI's next phase, constrained by slow-moving physical bottlenecks. The AI cycle is not over; the script has simply changed.

链捕手1 h fa

Tidal Investment: We Remain Bullish on the AI Industry Chain, But for Different Reasons Now

链捕手1 h fa

Grayscale: These 15 Profitable Crypto Protocols Are Severely Undervalued

Grayscale Research identifies 15 top-revenue crypto protocols trading at significant valuation discounts, with many at single-digit or even 1x revenue multiples. Protocols like Pump.fun, PancakeSwap, and Meteora have market capitalizations roughly equal to their annual revenue. The report argues these financially-focused protocols (DEXs, lending, staking) are fundamentally undervalued and could benefit from the potential passage of the CLARITY Act, expected as soon as next month. This legislation aims to clarify digital asset regulation, potentially reducing institutional barriers and driving on-chain activity. The analysis breaks down the protocols into three groups: the "1x Club" (market cap ≈ revenue), mid-tier protocols with 3-9x multiples (e.g., Aave, Lido, Jupiter), and high-multiple protocols like Hyperliquid (15x) and Uniswap (37x), where valuation reflects future potential rather than current cash flows. Grayscale applies a traditional DCF model to Aave, suggesting a one-year price target of ~$175, representing ~130% upside from current levels. The report notes a risk-off macro environment since the Iran conflict has further compressed valuations, creating a potential entry window. The conclusion highlights that while the valuation data presents an intriguing opportunity, the investment thesis is contingent on the CLARITY Act's passage and subsequent institutional capital flows. Investors are cautioned to consider Grayscale's inherent conflict of interest as a crypto asset manager with products tied to these assets.

marsbit1 h fa

Grayscale: These 15 Profitable Crypto Protocols Are Severely Undervalued

marsbit1 h fa

Sam Altman's Personal Alchemy of Wealth: Investing in 400 Companies, Over 10 Deeply Tied to OpenAI

The article investigates Sam Altman's personal wealth strategy, centered around his investments in approximately 400 companies while serving as OpenAI's CEO. Despite not holding direct equity in OpenAI, Altman has built a vast portfolio, with at least 10 of his investments having commercial ties or ongoing negotiations with OpenAI. This creates a complex network of potential conflicts of interest, drawing scrutiny from U.S. congressional committees and state attorneys general. Key investments highlighted include the anti-aging startup Retro Biosciences (valued at $258 million for his stake as of late last year) and the chipmaker Cerebras, whose value soared following an OpenAI procurement deal. His most significant financial gain is linked to the nuclear fusion company Helion, where a recent funding round reportedly increased his stake's value to at least $4.1 billion. The article details a decade-long relationship between Altman, Helion, and OpenAI, including a controversial non-binding power purchase agreement and Altman's efforts to secure investments from OpenAI and its backer SoftBank for Helion. Other points include internal investigations at Tools for Humanity (developer of Worldcoin) and OpenAI's massive contracts with tech giants like Nvidia. According to Forbes, Altman's net worth is around $3.4 billion, ranking him 1251st globally—a rise of over 1400 places since 2024. OpenAI's board states that Altman's external dealings are transparent and potential conflicts are carefully managed.

Odaily星球日报2 h fa

Sam Altman's Personal Alchemy of Wealth: Investing in 400 Companies, Over 10 Deeply Tied to OpenAI

Odaily星球日报2 h fa

Trading

Spot
Futures

Articoli Popolari

Come comprare RE

Benvenuto in HTX.com! Abbiamo reso l'acquisto di Re (RE) semplice e conveniente. Segui la nostra guida passo passo per intraprendere il tuo viaggio nel mondo delle criptovalute.Step 1: Crea il tuo Account HTXUsa la tua email o numero di telefono per registrarti il tuo account gratuito su HTX. Vivi un'esperienza facile e sblocca tutte le funzionalità,Crea il mio accountStep 2: Vai in Acquista crypto e seleziona il tuo metodo di pagamentoCarta di credito/debito: utilizza la tua Visa o Mastercard per acquistare immediatamente ReRE.Bilancio: Usa i fondi dal bilancio del tuo account HTX per fare trading senza problemi.Terze parti: abbiamo aggiunto metodi di pagamento molto utilizzati come Google Pay e Apple Pay per maggiore comodità.P2P: Fai trading direttamente con altri utenti HTX.Over-the-Counter (OTC): Offriamo servizi su misura e tassi di cambio competitivi per i trader.Step 3: Conserva Re (RE)Dopo aver acquistato Re (RE), conserva nel tuo account HTX. In alternativa, puoi inviare tramite trasferimento blockchain o scambiare per altre criptovalute.Step 4: Scambia Re (RE)Scambia facilmente Re (RE) nel mercato spot di HTX. Accedi al tuo account, seleziona la tua coppia di trading, esegui le tue operazioni e monitora in tempo reale. Offriamo un'esperienza user-friendly sia per chi ha appena iniziato che per i trader più esperti.

37 Totale visualizzazioniPubblicato il 2026.06.18Aggiornato il 2026.06.18

Come comprare RE

Discussioni

Benvenuto nella Community HTX. Qui puoi rimanere informato sugli ultimi sviluppi della piattaforma e accedere ad approfondimenti esperti sul mercato. Le opinioni degli utenti sul prezzo di RE RE sono presentate come di seguito.

活动图片