Auto Research Era: 47 Tasks Without Standard Answers Become the Must-Test Leaderboard for Agent Capabilities

marsbitPublished on 2026-05-13Last updated on 2026-05-13

Abstract

The article introduces Frontier-Eng Bench, a new benchmark for AI agents developed by Einsia AI's Navers lab. Unlike traditional tests with clear answers, this benchmark presents 47 complex, real-world engineering tasks—such as optimizing underwater robot stability, battery fast-charging protocols, or quantum circuit noise control—where there is no single correct solution, only continuous optimization towards a limit. It shifts AI evaluation from static knowledge retrieval to a dynamic "engineering closed-loop": the AI must propose solutions, run simulations, interpret errors, adjust parameters, and re-run experiments to iteratively improve performance. This process tests an agent's ability to learn and evolve through long-term feedback, much like a human engineer tackling trade-offs between power, safety, and performance. Key findings from the benchmark reveal two patterns: 1) Improvements follow a power-law decay, becoming harder and smaller as optimization progresses, and 2) While exploring multiple solution paths (breadth) helps, sustained depth in a single path is crucial for breakthrough innovations. The research suggests this marks a step toward "Auto Research," where AI systems can autonomously conduct continuous, tireless optimization in scientific and engineering domains. Humans would set high-level goals, while AI agents handle the iterative experimentation and refinement. This could fundamentally change research and development workflows.

If we throw AI into an engineering site with no standard answers, can it still survive?

For a long time, AI Agents have appeared omnipotent, but in reality, most are just 'flipping through memories' within known knowledge bases.

Yet the real engineering world is harsh: the stability of underwater robots, the lithium plating boundary of power batteries, the noise control of quantum circuits... These problems have no 'perfect score', only 'optimizations that inch closer to the limit'.

Recently, the Agent Benchmark released by Navers lab under Einsia AI—Frontier-Eng Bench—officially tore off the label of AI being an 'exam-crammer'.

The research team didn't have AI grind through outdated coding problems. Instead, they gave it a complete 'engineering closed loop': propose a solution, connect to the simulator, digest errors, adjust parameters, and re-run.

Faced with 47 hardcore tasks spanning multiple disciplines, AI must behave like a senior engineer, seeking the optimal solution within the 'impossible triangle' of power consumption, safety, and performance.

This is not just a test suite; it's more like a rehearsal for Agent 'evolution'.

When AI begins to learn self-correction from feedback, the Auto Research era, where 'humans set goals and AI iterates non-stop 24/7', might be closer than we imagine.

AI Starts Tackling 'Hard Work'

Past large language models were more like super straight-A students.

You pose a question, it 'flips through memory' from massive training data, then pieces together an answer that seems plausible.

In this mode, the large model is essentially playing 'word chain', not solving real-world problems.

But the emergence of Frontier-Eng Bench has AI doing the work of 'engineering optimization'.

The process has shifted to letting AI first propose a solution, then connect to a simulator to run experiments, subsequently obtain feedback and errors, modify parameters and code, and continue re-running until performance improves further.

In this closed-loop system, AI's identity undergoes a qualitative change.

Want to make the underwater robot more stable? AI must start automatically tuning the controller.

Want to increase the speed of the robotic arm a bit more? AI has to run simulations itself.

To some extent, AIs have shed their purely semantic understanding role and begun to act like professional engineers, continuously optimizing based on real-world environmental feedback.

△

The most interesting aspect of Frontier-Eng Bench is: it doesn't test whether AI 'answered correctly', but rather whether AI can continuously become stronger.

Because real engineering optimization is never about multiple-choice questions; there is no single standard answer.

Take fast-charging batteries as an example: the goal sounds simple—charge as fast as possible, but reality isn't so easy.

Under strict constraints like temperature mustn't spike, voltage can't overspeed, battery life can't drop too fast, and lithium plating must be avoided, AI must precisely hit the balance point of performance.

This means AI cannot pass through by any clever 'test-cramming' tricks; it must demonstrate endurance for continuous evolution through long-term feedback.

Can AI perform long-term optimization in real environments?

Looking at the results, GPT5.4 showed the most stable overall performance, but AIs still have a long way to go before 'solving' the Benchmark.

△

Auto Research Enters the 'Iterative Optimization' Era

The research team raised a very interesting point in their paper:

Truly advanced intelligence essentially relies on long-term feedback loops.

Just as AlphaGo could defeat Lee Sedol, it lay in the vast number of simulations and immediate feedback behind each decision, not the rote memorization of established game records.

True scientific research is the same: top labs don't rely on a single burst of inspiration, but continuously propose hypotheses, run experiments, examine results, modify plans, and try again.

Engineering optimization follows the same principle: anyone can create the first version; what's truly difficult is that final 1% performance leap.

The significance of Frontier-Eng Bench lies here: For the first time, it systematically begins testing AI's 'iterative optimization capability', and has summarized two nearly brutal laws of AI evolution.

△

The first law is: The further you go, the harder the improvement.

This paper found that the frequency and magnitude of Agent improvements follow a power-law decay:

Improvement frequency ∝ 1 / iteration count
Improvement magnitude ∝ 1 / improvement count

Simply put: the fastest gains come in the first few rounds, and it gets progressively harder and smaller later on.

This closely resembles the real R&D process: the first version of AI can quickly eliminate many 'low-hanging fruits', but the closer it gets to the bottleneck, the more effort is required to squeeze out even a bit more performance.

Would it be more cost-effective to explore multiple paths in parallel for trial and error? The answer lies in the second law.

△

The second law: Breadth is useful, but depth is even more indispensable.

Running multiple parallel paths can avoid getting stuck, but with a fixed budget, each additional chain opened shallows the depth of exploration.

Many engineering breakthroughs require continuous accumulation and constant correction before structural leaps emerge; they can't be achieved simply by 'trying a few more times'.

This actually points towards the development direction of next-generation Agents: not models that 'output an answer once', but systems that can continuously iterate and self-evolve within long-term feedback loops.

AI Engineers Might Really Be Coming

The true far-reaching significance of this research lies in its preliminary outline of an AI system beginning to approach the real engineering cycle.

△

Imagine when AI connects to industrial software, simulation environments, CAD systems, chip design tools, scientific computing platforms...

A dramatic transformation in the modality of productivity is on the verge of emerging.

In future labs, a division of labor like this might appear:

Human researchers are responsible for proposing directions and goals.

For example, 'reduce this component's energy consumption by 30%', 'compress this model's forward pass GPU usage even lower', 'increase the stability of robot control a bit more', 'push the fidelity of this quantum circuit closer to the limit', etc.

And AI is responsible for 'grinding the path'. They focus on these goals, continuously optimizing.

For example, automatically running simulations and experiments, automatically reading feedback from verifiers and simulators, then continuing to modify and optimize, iterating non-stop 24/7.

This evolutionary logic frees AI from the identity of an 'assistive tool', allowing it to begin solving complex system problems like a real engineering team—and tirelessly at that.

And the issues revealed by the Frontier-Eng Benchmark are actually very direct:

When AI begins to learn 'long-term optimization', how far is it from true engineering intelligence?

Paper Title: Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Project Homepage: https://lab.einsia.ai/frontier-eng/

Arxiv: https://arxiv.org/abs/2604.12290

GitHub repo: https://github.com/EinsiaLab/Frontier-Engineering

This article is from the WeChat public account "Quantum Bit", author: Yun Zhong

$30 Billion DeFi Capital Exodus: LayerZero Stumbles, Chainlink Feasts

Following the major DeFi security incident involving Kelp DAO, a significant migration of funds is underway from the cross-chain protocol LayerZero to Chainlink's CCIP (Cross-Chain Interoperability Protocol). Over $30 billion in Total Value Locked (TVL) from protocols like Kelp DAO, Solv Protocol, Re, and Tydro has moved to Chainlink in the past week, driven by security concerns. LayerZero is facing a severe trust crisis after the attack. Initially denying responsibility, LayerZero Labs has now issued a public apology, acknowledging management oversights. These include a vulnerable "1/1" single-node configuration for its Decentralized Verification Network (DVN) and past misuse of a multi-signature wallet by a team member. The protocol's weekly bridge volume has slumped to near-historic lows of around $470 million. In contrast, Chainlink is experiencing a surge in adoption and activity. Its independent active addresses recently hit multi-month highs, and whales have been accumulating LINK tokens. Beyond DeFi, Chainlink is securing partnerships with traditional finance giants like DTCC, European stock exchange operator SIX Group, and asset manager Amundi. While LayerZero has announced security upgrades—such as migrating to stronger multi-signature configurations and developing a second DVN client—and contributed to a rescue fund, the event underscores that security is becoming a decisive competitive factor as DeFi matures.

marsbit21m ago

$30 Billion DeFi Capital Exodus: LayerZero Stumbles, Chainlink Feasts

marsbit21m ago

The $13 Trillion Repo Market Is Quietly Being Rewritten by Blockchain

The $13 trillion repurchase agreement (repo) market, a crucial artery for global short-term funding, is experiencing a significant transformation through blockchain technology. After years of limited impact in finance, blockchain is finding substantial adoption in repo transactions. Major institutions like JPMorgan Chase, HSBC, and Broadridge are deploying tokenized repo platforms, with daily volumes already reaching tens of billions of dollars. Traditional repo markets operate with fixed hours, rely on intermediaries, and involve manual, time-consuming processes. Tokenized repos, by contrast, use blockchain to create digital tokens representing cash and securities collateral. This enables near-instantaneous settlement, 24/7 trading, automated execution, and enhanced auditability. The key drivers for adoption include maturing technology, more receptive regulators, and growing client recognition of tangible benefits like reduced operational friction and capital efficiency. Analyses, such as one from Broadridge, indicate that moving a portion of repo activity onto blockchain can significantly reduce a bank's required liquidity buffers, potentially freeing up billions in capital. The infrastructure is also seen as foundational for a future of round-the-clock trading for traditional assets. Challenges remain, including the existence of fragmented blockchain networks, the need for stress testing under extreme market conditions, and the loss of operational flexibility compared to manual processes. However, the industry consensus is that these are implementation hurdles. Tokenized repo has moved beyond pilot stages to become one of blockchain's most concrete and impactful applications in traditional finance, marking a pivotal shift in how a core market functions.

marsbit22m ago

The $13 Trillion Repo Market Is Quietly Being Rewritten by Blockchain

marsbit22m ago

Bitwise: Why Are Top-Tier Capital Giants Aggressively Betting on New Public Blockchains Like Arc, Canton, and Tempo?

Why Top Institutions Are Betting Big on New Blockchains Like Arc, Canton, and Tempo This week saw a surge of major funding announcements for new, purpose-built blockchains. Circle's Arc raised $222M at a $3B valuation from investors like BlackRock. Canton Network developer Digital Asset secured $300M led by a16z at a $2B valuation. Stripe's Tempo, already a leader, raised $500M last year and has partnered with major firms. These three chains share key traits: they are designed for stablecoins and asset tokenization, they emerged after the US passed the *Genius Act* in July 2025, they natively support private transactions crucial for enterprise adoption, and they are backed by traditional finance and tech giants—unlike the crypto-native origins of Ethereum or Solana. This trend highlights three major shifts: 1) Clear regulation (like the pending *Clarity Act*) is unlocking massive institutional capital. 2) Built-in privacy is becoming a core feature for real-world business use, addressing the limitations of fully transparent ledgers. 3) Established corporations are now directly entering the blockchain arena, bringing significant resources and execution capability, which will accelerate innovation and competition across the entire crypto ecosystem.

marsbit34m ago

Bitwise: Why Are Top-Tier Capital Giants Aggressively Betting on New Public Blockchains Like Arc, Canton, and Tempo?

marsbit34m ago

From Gas Limit to 'Keyed Nonces', How to Understand the Next Step in Ethereum Scalability?

Ethereum’s scalability efforts are shifting toward a user-centric approach—focusing not only on higher TPS, but on translating technical upgrades into lower costs, smoother operations, and better wallet experiences. Two recent developments highlight this direction: - **Raising the Gas Limit to 200 million**: Following the Fusaka upgrade that increased it to 60 million, a consensus has formed around a potential future increase to 200 million. This would boost Ethereum’s execution capacity, but it is planned alongside other upgrades—such as ePBS, Block-Level Access Lists (BAL), and EIP-8037—to manage state growth and keep node operation viable for average participants. - **Keyed Nonces (EIP-8250)**: This proposal aims to improve how transactions are queued. Instead of a single linear nonce per account, it introduces multiple independent nonce domains. This prevents different types of transactions—such as private payments, session keys, or batch operations—from blocking each other. Vitalik Buterin views this as a foundational step toward better privacy support and more flexible state scalability. Together, these upgrades are part of a broader move to push complexity from wallets, DApps, and relays back into the protocol layer. For everyday users, this means future Ethereum interactions could become less congested, more intuitive, and safer—especially as core improvements in account abstraction, cross-L2 interoperability, and node decentralization continue to progress. Ultimately, Ethereum is evolving to handle not just more transactions, but more varied and complex on-chain use cases while preserving its decentralized foundation.

marsbit44m ago

From Gas Limit to 'Keyed Nonces', How to Understand the Next Step in Ethereum Scalability?

marsbit44m ago

Leaving OpenAI, How Much Has Their Net Worth Increased?

Former OpenAI employees have collectively accrued near-trillion dollar valuations through ventures and investments, charting AI's future. The article highlights two main paths: founding high-value companies like Anthropic and Perplexity, or applying insider insights as investors. Leopold Aschenbrenner exemplifies the investor path. After being fired from OpenAI, he leveraged firsthand knowledge of AI's massive energy demands to make hugely successful public market bets on nuclear and fuel cell companies, practicing "cross-industry cognitive arbitrage." Other alumni, like the Zero Shot VC fund founders, use their technical foresight for early-stage investing. Their key advantage lies not just in picking winners, but in knowing which technical approaches are likely dead ends—a "veto list" derived from internal OpenAI experience. Angel investing within the network, as seen with Mira Murati and Sam Altman, operates on deep, pre-existing understanding of a founder's capabilities, reducing due diligence to near zero. This creates an ecosystem bound by a shared belief in AGI's imminent arrival, differing from networks like the "PayPal Mafia" which were built on shared past struggles. The shift of these builders to investors signals a profound conviction: their situational awareness of the AI landscape is now so clear that deploying capital based on that judgment is more efficient than building themselves. They are allocating bets on the future they helped shape from the inside.

marsbit55m ago

Leaving OpenAI, How Much Has Their Net Worth Increased?

marsbit55m ago

Trading

Spot

Futures

Hot Articles

In-Depth Research Report on Web4.0: The Rise, Technical Logic, and Future Landscape of the AI-Economy Era

As Web3 strives to return value ownership to humanity, a far deeper paradigm shift is already quietly unfolding in the silicon-based world.

1.7k Total ViewsPublished 2026.03.12Updated 2026.03.12

In-Depth Research Report on Web4.0: The Rise, Technical Logic, and Future Landscape of the AI-Economy Era

Ethena: Building a New Era of Web3‑Native Digital Dollars

Ethena is an Ethereum‑based synthetic dollar protocol that delivers crypto‑native monetary solutions, including USDe, a synthetic dollar, and sUSDe, a globally accessible U.S. dollar savings asset.

52.9k Total ViewsPublished 2026.03.16Updated 2026.03.16

Ethena: Building a New Era of Web3‑Native Digital Dollars

In-depth Analysis of AI and Crypto: The Era of Symbiosis between Algorithms and Ledgers

By 2026, the integration of artificial intelligence and cryptocurrency has advanced from proof-of-concept to a new stage of "system-level integration".

1.7k Total ViewsPublished 2026.03.26Updated 2026.03.26

In-depth Analysis of AI and Crypto: The Era of Symbiosis between Algorithms and Ledgers

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of ERA (ERA) are presented below.