Auto Research Era: 47 Tasks Without Standard Answers Become the Must-Test Leaderboard for Agent Capabilities

marsbitОпубликовано 2026-05-13Обновлено 2026-05-13

Введение

The article introduces Frontier-Eng Bench, a new benchmark for AI agents developed by Einsia AI's Navers lab. Unlike traditional tests with clear answers, this benchmark presents 47 complex, real-world engineering tasks—such as optimizing underwater robot stability, battery fast-charging protocols, or quantum circuit noise control—where there is no single correct solution, only continuous optimization towards a limit. It shifts AI evaluation from static knowledge retrieval to a dynamic "engineering closed-loop": the AI must propose solutions, run simulations, interpret errors, adjust parameters, and re-run experiments to iteratively improve performance. This process tests an agent's ability to learn and evolve through long-term feedback, much like a human engineer tackling trade-offs between power, safety, and performance. Key findings from the benchmark reveal two patterns: 1) Improvements follow a power-law decay, becoming harder and smaller as optimization progresses, and 2) While exploring multiple solution paths (breadth) helps, sustained depth in a single path is crucial for breakthrough innovations. The research suggests this marks a step toward "Auto Research," where AI systems can autonomously conduct continuous, tireless optimization in scientific and engineering domains. Humans would set high-level goals, while AI agents handle the iterative experimentation and refinement. This could fundamentally change research and development workflows.

If we throw AI into an engineering site with no standard answers, can it still survive?

For a long time, AI Agents have appeared omnipotent, but in reality, most are just 'flipping through memories' within known knowledge bases.

Yet the real engineering world is harsh: the stability of underwater robots, the lithium plating boundary of power batteries, the noise control of quantum circuits... These problems have no 'perfect score', only 'optimizations that inch closer to the limit'.

Recently, the Agent Benchmark released by Navers lab under Einsia AI—Frontier-Eng Bench—officially tore off the label of AI being an 'exam-crammer'.

The research team didn't have AI grind through outdated coding problems. Instead, they gave it a complete 'engineering closed loop': propose a solution, connect to the simulator, digest errors, adjust parameters, and re-run.

Faced with 47 hardcore tasks spanning multiple disciplines, AI must behave like a senior engineer, seeking the optimal solution within the 'impossible triangle' of power consumption, safety, and performance.

This is not just a test suite; it's more like a rehearsal for Agent 'evolution'.

When AI begins to learn self-correction from feedback, the Auto Research era, where 'humans set goals and AI iterates non-stop 24/7', might be closer than we imagine.

AI Starts Tackling 'Hard Work'

Past large language models were more like super straight-A students.

You pose a question, it 'flips through memory' from massive training data, then pieces together an answer that seems plausible.

In this mode, the large model is essentially playing 'word chain', not solving real-world problems.

But the emergence of Frontier-Eng Bench has AI doing the work of 'engineering optimization'.

The process has shifted to letting AI first propose a solution, then connect to a simulator to run experiments, subsequently obtain feedback and errors, modify parameters and code, and continue re-running until performance improves further.

In this closed-loop system, AI's identity undergoes a qualitative change.

Want to make the underwater robot more stable? AI must start automatically tuning the controller.

Want to increase the speed of the robotic arm a bit more? AI has to run simulations itself.

To some extent, AIs have shed their purely semantic understanding role and begun to act like professional engineers, continuously optimizing based on real-world environmental feedback.

△

The most interesting aspect of Frontier-Eng Bench is: it doesn't test whether AI 'answered correctly', but rather whether AI can continuously become stronger.

Because real engineering optimization is never about multiple-choice questions; there is no single standard answer.

Take fast-charging batteries as an example: the goal sounds simple—charge as fast as possible, but reality isn't so easy.

Under strict constraints like temperature mustn't spike, voltage can't overspeed, battery life can't drop too fast, and lithium plating must be avoided, AI must precisely hit the balance point of performance.

This means AI cannot pass through by any clever 'test-cramming' tricks; it must demonstrate endurance for continuous evolution through long-term feedback.

Can AI perform long-term optimization in real environments?

Looking at the results, GPT5.4 showed the most stable overall performance, but AIs still have a long way to go before 'solving' the Benchmark.

△

Auto Research Enters the 'Iterative Optimization' Era

The research team raised a very interesting point in their paper:

Truly advanced intelligence essentially relies on long-term feedback loops.

Just as AlphaGo could defeat Lee Sedol, it lay in the vast number of simulations and immediate feedback behind each decision, not the rote memorization of established game records.

True scientific research is the same: top labs don't rely on a single burst of inspiration, but continuously propose hypotheses, run experiments, examine results, modify plans, and try again.

Engineering optimization follows the same principle: anyone can create the first version; what's truly difficult is that final 1% performance leap.

The significance of Frontier-Eng Bench lies here: For the first time, it systematically begins testing AI's 'iterative optimization capability', and has summarized two nearly brutal laws of AI evolution.

△

The first law is: The further you go, the harder the improvement.

This paper found that the frequency and magnitude of Agent improvements follow a power-law decay:

Improvement frequency ∝ 1 / iteration count
Improvement magnitude ∝ 1 / improvement count

Simply put: the fastest gains come in the first few rounds, and it gets progressively harder and smaller later on.

This closely resembles the real R&D process: the first version of AI can quickly eliminate many 'low-hanging fruits', but the closer it gets to the bottleneck, the more effort is required to squeeze out even a bit more performance.

Would it be more cost-effective to explore multiple paths in parallel for trial and error? The answer lies in the second law.

△

The second law: Breadth is useful, but depth is even more indispensable.

Running multiple parallel paths can avoid getting stuck, but with a fixed budget, each additional chain opened shallows the depth of exploration.

Many engineering breakthroughs require continuous accumulation and constant correction before structural leaps emerge; they can't be achieved simply by 'trying a few more times'.

This actually points towards the development direction of next-generation Agents: not models that 'output an answer once', but systems that can continuously iterate and self-evolve within long-term feedback loops.

AI Engineers Might Really Be Coming

The true far-reaching significance of this research lies in its preliminary outline of an AI system beginning to approach the real engineering cycle.

△

Imagine when AI connects to industrial software, simulation environments, CAD systems, chip design tools, scientific computing platforms...

A dramatic transformation in the modality of productivity is on the verge of emerging.

In future labs, a division of labor like this might appear:

Human researchers are responsible for proposing directions and goals.

For example, 'reduce this component's energy consumption by 30%', 'compress this model's forward pass GPU usage even lower', 'increase the stability of robot control a bit more', 'push the fidelity of this quantum circuit closer to the limit', etc.

And AI is responsible for 'grinding the path'. They focus on these goals, continuously optimizing.

For example, automatically running simulations and experiments, automatically reading feedback from verifiers and simulators, then continuing to modify and optimize, iterating non-stop 24/7.

This evolutionary logic frees AI from the identity of an 'assistive tool', allowing it to begin solving complex system problems like a real engineering team—and tirelessly at that.

And the issues revealed by the Frontier-Eng Benchmark are actually very direct:

When AI begins to learn 'long-term optimization', how far is it from true engineering intelligence?

Paper Title: Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Project Homepage: https://lab.einsia.ai/frontier-eng/

Arxiv: https://arxiv.org/abs/2604.12290

GitHub repo: https://github.com/EinsiaLab/Frontier-Engineering

This article is from the WeChat public account "Quantum Bit", author: Yun Zhong

Связанные с этим вопросы

QWhat is the main purpose of the Frontier-Eng Benchmark released by Einsteina AI's Navers lab?

AThe main purpose of the Frontier-Eng Benchmark is to move beyond testing AI's ability to recall known information. It systematically tests AI agents' capability for 'iterative optimization' on 47 real-world, open-ended engineering tasks without standard answers, evaluating if they can continuously improve performance through a feedback loop involving simulation, error analysis, and parameter adjustment.

QHow does the AI's role change in the Frontier-Eng Benchmark testing process compared to traditional language models?

AIn the Frontier-Eng Benchmark, the AI transitions from acting as a 'super student' that retrieves and assembles answers from training data to performing 'engineering optimization.' Its role becomes akin to a professional engineer: it proposes solutions, runs simulations, analyzes feedback and errors, modifies parameters/code, and reruns experiments in a continuous loop to seek optimal performance under complex constraints.

QWhat are the two key 'AI evolution laws' discovered through the Frontier-Eng Benchmark regarding iterative optimization?

AThe two key laws are: 1) Improvements become progressively harder and smaller (showing a power-law decay: Improvement frequency ∝ 1/iteration count, Improvement magnitude ∝ 1/improvement count). 2) While exploring multiple parallel paths (breadth) is useful, sustained depth in a single optimization path is more critical for achieving structural breakthroughs, as fixed budgets force a trade-off between breadth and depth.

QWhat future work paradigm does the article suggest might emerge from the development of self-evolving AI agents?

AThe article suggests a future 'Auto Research' paradigm where human researchers define the goals and direction (e.g., 'reduce component energy consumption by 30%'), and AI agents take on the role of 'grinding the path.' They would work autonomously and tirelessly—running simulations, interpreting feedback from verifiers and simulators, and iteratively optimizing—24/7 to approach performance limits.

QAccording to the article, what fundamental shift in AI capability does the Frontier-Eng Benchmark represent?

AThe Frontier-Eng Benchmark represents a fundamental shift from evaluating AI's ability to find predetermined 'correct answers' to testing its capacity for 'self-evolution' through long-term feedback loops. It moves the focus to whether AI can demonstrate sustained learning and improvement in complex, real-world scenarios with no single correct answer, pushing AI closer to genuine engineering intelligence.

Похожее

I've Been a VC in Web3 for Nine Years: Asian Funds Are Experiencing "Hell Mode"

After nine years as a Web3 VC, the author observes a severe downturn in Asia's crypto venture capital scene, with many funds disappearing or pivoting away. The market has cooled dramatically since the 2021-2024 frenzy, leading to fewer deals and active investors. IOSG Ventures, a firm that has endured three market cycles, has adapted its strategy: shifting from 80-90% early-stage investments to a 50% early-stage, 30% post-TGE, and 20% OTC portfolio to find better value and liquidity. The current bear market is described as "hell mode" for Asian funds due to scarce LP capital, forcing extreme precision in targeting only top projects. The author argues the core industry problem has been the disconnect between tokens and real value, where tokens served as fundraising tools without granting holders rights to protocol revenue. A positive shift is emerging where projects like Uniswap and Morpho are programmatically binding token value to protocol profits. Investment focus has moved towards fundamentals: real-yield financial infrastructure (stablecoins, lending) and crypto-native AI infrastructure, while avoiding narrative-driven projects. The conclusion is that true, durable companies are born in pessimistic times when focus shifts to real user needs and sustainable business models. The industry's future will be shaped by those who remain after the泡沫 dissipates.

marsbit3 мин. назад

I've Been a VC in Web3 for Nine Years: Asian Funds Are Experiencing "Hell Mode"

marsbit3 мин. назад

Cango Releases Q1 Financial Report: Total Revenue of $102 Million, Business Expands into AI Computing Infrastructure

Cango Releases Q1 2026 Financial Results: Total Revenue of $102 Million, Business Expands into AI Compute Infrastructure Bitcoin mining company Cango reported unaudited financial results for Q1 2026. While bitcoin mining remains its core revenue driver, the company is strategically expanding into energy and AI compute infrastructure. **Key Financial & Operational Highlights:** * **Revenue & Performance:** Total revenue for the quarter was $102 million, with $98.4 million coming from bitcoin mining. However, the company reported a net loss of $261.1 million, primarily attributed to non-cash impacts like bitcoin price declines leading to miner impairments and fair value losses on its bitcoin holdings. Notably, long-term debt was significantly reduced to $30.6 million from $557.6 million at the end of 2025. * **Mining Operations:** Cango's total hash rate was 37.01 EH/s. It mined 1,266 bitcoin during the quarter and reduced its average cash cost per bitcoin by 9.0% quarter-over-quarter to $76,928, demonstrating improved operational efficiency. * **AI Business Expansion:** The company introduced EcoHash, a new commercial platform. This initiative leverages Cango's existing expertise in energy management and high-density computing to provide infrastructure for AI workloads, starting with GPU compute leasing. Management emphasized executing a disciplined strategy to strengthen the core mining business while advancing AI infrastructure through EcoHash. They highlighted progress in cost reduction, stable global operations, and a strengthened balance sheet through debt reduction.

marsbit4 мин. назад

Cango Releases Q1 Financial Report: Total Revenue of $102 Million, Business Expands into AI Computing Infrastructure

marsbit4 мин. назад

30-Year Treasury Yield Breaks 5% Again: The Era of 'Everything Is Cheap' Is Over

The yield on the 30-year U.S. Treasury bond has again surpassed 5%, signaling a fundamental shift as markets begin to accept that high interest rates are here to stay. This reflects the simultaneous breakdown of three pillars that underpinned half a century of low inflation and low rates: cheap capital, cheap labor, and cheap energy. Global capital flows are shifting, energy security is strained, and labor costs are rising due to shortages and unionization, though partly offset by AI's impact. Long-term pressures like soaring government debt, geopolitical friction, and populism are also pushing up long-term borrowing costs. The role of AI remains the biggest uncertainty—it could boost productivity and lower debt or become a new source of inflation if it merely automates jobs while consuming vast resources. The core challenge for investors is adapting their models and expectations, calibrated during decades of cheap money, to this new, more persistent high-rate environment.

marsbit10 мин. назад

30-Year Treasury Yield Breaks 5% Again: The Era of 'Everything Is Cheap' Is Over

marsbit10 мин. назад

Another Corporate Bitcoin Treasury Strategy Ends: From High-Profile Entry to Liquidation at a Massive Loss in 11 Months

French semiconductor company Sequans Communications has sold off its bitcoin holdings and terminated its corporate bitcoin treasury strategy less than a year after launching it, sustaining heavy losses. Facing delisting from the New York Stock Exchange in mid-2025 due to low market capitalization, Sequans announced a plan to hold over 3,000 bitcoin as a long-term reserve asset. The strategy was executed with Swan Bitcoin and backed by a $384 million private financing round. At its peak in October 2025, the company held 3,234 bitcoin with an average cost of approximately $116,643 per coin. However, the plan quickly unraveled. With bitcoin's price falling, Sequans sold 970 bitcoin in late 2025 to repay debt, contradicting the core "hold" philosophy of such corporate strategies. The company has now sold more bitcoin to fully repay its convertible notes and announced the termination of its bitcoin reserve strategy. It plans to liquidate its remaining 658 bitcoin. The venture resulted in significant financial damage. The company reported an unrealized loss of $67.4 million on its bitcoin holdings in 2025, contributing to a total net loss of $109.3 million for the year. Sequans' stock (SQNS) has plummeted over 80% since the strategy's launch and is down 77% year-to-date. CEO Georges Karam, who previously championed bitcoin's long-term value, now states the company will refocus entirely on its core IoT semiconductor business. The failed experiment highlights the risks for companies adopting volatile digital assets as treasury reserves.

marsbit36 мин. назад

BIS Latest Research: The Future of Stablecoins and the Global Monetary Landscape

BIS Working Paper No. 170, released in May 2026, analyzes the impact of stablecoins on the global monetary system. The market has grown exponentially since 2014, with over 300 active stablecoins exceeding $300 billion in market capitalization. It is highly concentrated, dominated by USD-linked stablecoins (98% by market cap, mainly USDT and USDC), which function as new forms of private offshore dollar claims on blockchain. Currently, stablecoin use remains largely within crypto ecosystems for trading and DeFi collateral. Real-economy adoption, such as in cross-border payments, is nascent but growing in emerging markets and developing economies (EMDEs) facing high inflation and volatile currencies, where they facilitate capital flight and "digital dollarization." The paper assesses impacts using the Cohen-Kennen framework. For private-sector functions, stablecoins most directly affect value storage (as a dollar-denominated safe haven in EMDEs) and the medium of exchange (enhancing cross-border payment efficiency, further entrenching dollar use). Impacts on the unit of account and official-sector functions are currently limited but could indirectly constrain monetary policy autonomy and capital controls. The report outlines three potential future scenarios: 1) **Niche adoption**, where stablecoins remain crypto-centric with minimal systemic impact; 2) **Digital dollarization**, a high-risk scenario where USD stablecoins become de facto standards in EMDEs, eroding monetary sovereignty; and 3) **Local currency stablecoin integration**, an ideal but challenging scenario where regulated domestic stablecoins linked to CBDCs enhance efficiency without foreign currency substitution. Key policy recommendations emphasize global coordination: establishing uniform regulatory standards (e.g., for reserves and disclosure), strengthening cross-border supervisory cooperation, enhancing domestic defenses in EMDEs (via macroeconomic stability, improved payment systems, and CBDCs), and combating illicit activities. The paper concludes that stablecoins are a structural force reinforcing dollar dominance in the near term, posing significant risks to EMDEs' financial stability and policy autonomy. Their long-term trajectory depends on regulatory responses, adoption patterns, and the co-evolution with public digital currencies.

marsbit44 мин. назад

BIS Latest Research: The Future of Stablecoins and the Global Monetary Landscape

marsbit44 мин. назад

Торговля

Спот

Фьючерсы

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на ERA (ERA) представлены ниже.