Auto Research Era: 47 Tasks Without Standard Answers Become the Must-Test Leaderboard for Agent Capabilities

marsbitPublicado em 2026-05-13Última atualização em 2026-05-13

Resumo

The article introduces Frontier-Eng Bench, a new benchmark for AI agents developed by Einsia AI's Navers lab. Unlike traditional tests with clear answers, this benchmark presents 47 complex, real-world engineering tasks—such as optimizing underwater robot stability, battery fast-charging protocols, or quantum circuit noise control—where there is no single correct solution, only continuous optimization towards a limit. It shifts AI evaluation from static knowledge retrieval to a dynamic "engineering closed-loop": the AI must propose solutions, run simulations, interpret errors, adjust parameters, and re-run experiments to iteratively improve performance. This process tests an agent's ability to learn and evolve through long-term feedback, much like a human engineer tackling trade-offs between power, safety, and performance. Key findings from the benchmark reveal two patterns: 1) Improvements follow a power-law decay, becoming harder and smaller as optimization progresses, and 2) While exploring multiple solution paths (breadth) helps, sustained depth in a single path is crucial for breakthrough innovations. The research suggests this marks a step toward "Auto Research," where AI systems can autonomously conduct continuous, tireless optimization in scientific and engineering domains. Humans would set high-level goals, while AI agents handle the iterative experimentation and refinement. This could fundamentally change research and development workflows.

If we throw AI into an engineering site with no standard answers, can it still survive?

For a long time, AI Agents have appeared omnipotent, but in reality, most are just 'flipping through memories' within known knowledge bases.

Yet the real engineering world is harsh: the stability of underwater robots, the lithium plating boundary of power batteries, the noise control of quantum circuits... These problems have no 'perfect score', only 'optimizations that inch closer to the limit'.

Recently, the Agent Benchmark released by Navers lab under Einsia AIFrontier-Eng Bench—officially tore off the label of AI being an 'exam-crammer'.

The research team didn't have AI grind through outdated coding problems. Instead, they gave it a complete 'engineering closed loop': propose a solution, connect to the simulator, digest errors, adjust parameters, and re-run.

Faced with 47 hardcore tasks spanning multiple disciplines, AI must behave like a senior engineer, seeking the optimal solution within the 'impossible triangle' of power consumption, safety, and performance.

This is not just a test suite; it's more like a rehearsal for Agent 'evolution'.

When AI begins to learn self-correction from feedback, the Auto Research era, where 'humans set goals and AI iterates non-stop 24/7', might be closer than we imagine.

AI Starts Tackling 'Hard Work'

Past large language models were more like super straight-A students.

You pose a question, it 'flips through memory' from massive training data, then pieces together an answer that seems plausible.

In this mode, the large model is essentially playing 'word chain', not solving real-world problems.

But the emergence of Frontier-Eng Bench has AI doing the work of 'engineering optimization'.

The process has shifted to letting AI first propose a solution, then connect to a simulator to run experiments, subsequently obtain feedback and errors, modify parameters and code, and continue re-running until performance improves further.

In this closed-loop system, AI's identity undergoes a qualitative change.

Want to make the underwater robot more stable? AI must start automatically tuning the controller.

Want to increase the speed of the robotic arm a bit more? AI has to run simulations itself.

To some extent, AIs have shed their purely semantic understanding role and begun to act like professional engineers, continuously optimizing based on real-world environmental feedback.

The most interesting aspect of Frontier-Eng Bench is: it doesn't test whether AI 'answered correctly', but rather whether AI can continuously become stronger.

Because real engineering optimization is never about multiple-choice questions; there is no single standard answer.

Take fast-charging batteries as an example: the goal sounds simple—charge as fast as possible, but reality isn't so easy.

Under strict constraints like temperature mustn't spike, voltage can't overspeed, battery life can't drop too fast, and lithium plating must be avoided, AI must precisely hit the balance point of performance.

This means AI cannot pass through by any clever 'test-cramming' tricks; it must demonstrate endurance for continuous evolution through long-term feedback.

Can AI perform long-term optimization in real environments?

Looking at the results, GPT5.4 showed the most stable overall performance, but AIs still have a long way to go before 'solving' the Benchmark.

Auto Research Enters the 'Iterative Optimization' Era

The research team raised a very interesting point in their paper:

Truly advanced intelligence essentially relies on long-term feedback loops.

Just as AlphaGo could defeat Lee Sedol, it lay in the vast number of simulations and immediate feedback behind each decision, not the rote memorization of established game records.

True scientific research is the same: top labs don't rely on a single burst of inspiration, but continuously propose hypotheses, run experiments, examine results, modify plans, and try again.

Engineering optimization follows the same principle: anyone can create the first version; what's truly difficult is that final 1% performance leap.

The significance of Frontier-Eng Bench lies here: For the first time, it systematically begins testing AI's 'iterative optimization capability', and has summarized two nearly brutal laws of AI evolution.

The first law is: The further you go, the harder the improvement.

This paper found that the frequency and magnitude of Agent improvements follow a power-law decay:

  • Improvement frequency ∝ 1 / iteration count
  • Improvement magnitude ∝ 1 / improvement count

Simply put: the fastest gains come in the first few rounds, and it gets progressively harder and smaller later on.

This closely resembles the real R&D process: the first version of AI can quickly eliminate many 'low-hanging fruits', but the closer it gets to the bottleneck, the more effort is required to squeeze out even a bit more performance.

Would it be more cost-effective to explore multiple paths in parallel for trial and error? The answer lies in the second law.

The second law: Breadth is useful, but depth is even more indispensable.

Running multiple parallel paths can avoid getting stuck, but with a fixed budget, each additional chain opened shallows the depth of exploration.

Many engineering breakthroughs require continuous accumulation and constant correction before structural leaps emerge; they can't be achieved simply by 'trying a few more times'.

This actually points towards the development direction of next-generation Agents: not models that 'output an answer once', but systems that can continuously iterate and self-evolve within long-term feedback loops.

AI Engineers Might Really Be Coming

The true far-reaching significance of this research lies in its preliminary outline of an AI system beginning to approach the real engineering cycle.

Imagine when AI connects to industrial software, simulation environments, CAD systems, chip design tools, scientific computing platforms...

A dramatic transformation in the modality of productivity is on the verge of emerging.

In future labs, a division of labor like this might appear:

Human researchers are responsible for proposing directions and goals.

For example, 'reduce this component's energy consumption by 30%', 'compress this model's forward pass GPU usage even lower', 'increase the stability of robot control a bit more', 'push the fidelity of this quantum circuit closer to the limit', etc.

And AI is responsible for 'grinding the path'. They focus on these goals, continuously optimizing.

For example, automatically running simulations and experiments, automatically reading feedback from verifiers and simulators, then continuing to modify and optimize, iterating non-stop 24/7.

This evolutionary logic frees AI from the identity of an 'assistive tool', allowing it to begin solving complex system problems like a real engineering team—and tirelessly at that.

And the issues revealed by the Frontier-Eng Benchmark are actually very direct:

When AI begins to learn 'long-term optimization', how far is it from true engineering intelligence?

Paper Title: Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Project Homepage: https://lab.einsia.ai/frontier-eng/

Arxiv: https://arxiv.org/abs/2604.12290

GitHub repo: https://github.com/EinsiaLab/Frontier-Engineering

This article is from the WeChat public account "Quantum Bit", author: Yun Zhong

Perguntas relacionadas

QWhat is the main purpose of the Frontier-Eng Benchmark released by Einsteina AI's Navers lab?

AThe main purpose of the Frontier-Eng Benchmark is to move beyond testing AI's ability to recall known information. It systematically tests AI agents' capability for 'iterative optimization' on 47 real-world, open-ended engineering tasks without standard answers, evaluating if they can continuously improve performance through a feedback loop involving simulation, error analysis, and parameter adjustment.

QHow does the AI's role change in the Frontier-Eng Benchmark testing process compared to traditional language models?

AIn the Frontier-Eng Benchmark, the AI transitions from acting as a 'super student' that retrieves and assembles answers from training data to performing 'engineering optimization.' Its role becomes akin to a professional engineer: it proposes solutions, runs simulations, analyzes feedback and errors, modifies parameters/code, and reruns experiments in a continuous loop to seek optimal performance under complex constraints.

QWhat are the two key 'AI evolution laws' discovered through the Frontier-Eng Benchmark regarding iterative optimization?

AThe two key laws are: 1) Improvements become progressively harder and smaller (showing a power-law decay: Improvement frequency ∝ 1/iteration count, Improvement magnitude ∝ 1/improvement count). 2) While exploring multiple parallel paths (breadth) is useful, sustained depth in a single optimization path is more critical for achieving structural breakthroughs, as fixed budgets force a trade-off between breadth and depth.

QWhat future work paradigm does the article suggest might emerge from the development of self-evolving AI agents?

AThe article suggests a future 'Auto Research' paradigm where human researchers define the goals and direction (e.g., 'reduce component energy consumption by 30%'), and AI agents take on the role of 'grinding the path.' They would work autonomously and tirelessly—running simulations, interpreting feedback from verifiers and simulators, and iteratively optimizing—24/7 to approach performance limits.

QAccording to the article, what fundamental shift in AI capability does the Frontier-Eng Benchmark represent?

AThe Frontier-Eng Benchmark represents a fundamental shift from evaluating AI's ability to find predetermined 'correct answers' to testing its capacity for 'self-evolution' through long-term feedback loops. It moves the focus to whether AI can demonstrate sustained learning and improvement in complex, real-world scenarios with no single correct answer, pushing AI closer to genuine engineering intelligence.

Leituras Relacionadas

A Nation Blocks Chips, a Giant Buys a Nuclear Power Plant: Why It's Time to Seriously Consider DeAI

**Title: Great Powers Blockade Chips, Giants Buy Nuclear Plants: Why It's Time to Seriously Consider DeAI** In May 2026, the US closed loopholes for Chinese firms to acquire advanced NVIDIA chips via overseas subsidiaries. That same month, Kenya halted a $1B geothermal data center project involving Microsoft, fearing its immense energy consumption. Meanwhile, Huawei announced mass production of its Ascend AI chip. These disparate events underscore a new reality: the competition for computing power ("compute") has escalated beyond the tech industry, becoming a geopolitical and infrastructural battleground. A new era of oligopoly is forming, with control over the AI stack—from GPU chips (NVIDIA) and cloud platforms (AWS, Azure, Google Cloud) to foundational models (OpenAI, Anthropic)—concentrating in a few Western "AI Octopus" corporations. This centralization creates systemic risks: pricing power and platform lock-in for users, infrastructure fragility, and a widening "compute divide" that threatens to marginalize nations without independent AI capacity. An "AI Iron Curtain" is deepening through export controls. In response, some nations like Saudi Arabia and the UAE are investing heavily to buy compute power, aiming to transition from oil to AI economies. The EU seeks to triple its compute capacity by 2030 to reduce dependency. However, the spending gap is vast, with four US tech giants alone planning ~$750B in AI capex for 2026. The race is increasingly constrained by energy, with AI tasks consuming up to 1000x more power than web searches, pushing firms to even acquire nuclear plants. This landscape is fueling interest in Decentralized AI (DeAI). It proposes a third way: using open protocols to coordinate a global network of idle GPUs, independent developers, and data centers, creating an AI infrastructure without a single controlling entity. Leveraging blockchain and cryptographic verification, DeAI aims to break market concentration, disperse energy demands, reduce geopolitical dependencies, and enhance transparency. While still nascent in performance and stability, DeAI's core promise is not immediate superiority but providing a crucial alternative architecture to resist monopoly, censorship, and centralized power. As specialized AI hardware costs fall and open-source models flourish, the window to build this foundation is open. The very existence of such competition serves as a vital check against the inevitable abuse of concentrated power.

marsbitHá 35m

A Nation Blocks Chips, a Giant Buys a Nuclear Power Plant: Why It's Time to Seriously Consider DeAI

marsbitHá 35m

Outpoll Review: A Prediction Market Platform Built for Active Traders

Outpoll Review: A Prediction Market Platform Built for Active Traders In recent years, prediction markets have grown from a niche sector to a mainstream arena, attracting billions in trading volume and institutional capital. However, the user experience and tools for traders have not kept pace. Outpoll, a new global prediction market platform, aims to fill this gap by providing enhanced trading infrastructure for active and professional traders. Built on standard prediction market principles, Outpoll allows users to trade on the outcome of specific events. It uses fully collateralized contracts with USDC settlement, charges a competitive 0.1% fee per trade, and provides clear settlement rules upfront to minimize disputes. A key focus for Outpoll is its professional-grade trading tools. The platform supports limit and market orders, as well as take-profit and stop-loss orders for open positions—features uncommon in prediction markets. For automated trading, Outpoll offers comprehensive REST and WebSocket APIs, enabling portfolio management, price arbitrage, and integration with existing tools. The platform also features a creator-led market model, where approved experts and community leaders can create and manage markets for niche topics under platform supervision. Its integrated interface combines news feeds directly with trading functions, allowing users to monitor events and manage positions seamlessly. Outpoll launched with a native Android app (available on Google Play) and plans an iOS version later this year. In summary, Outpoll distinguishes itself with trader-focused tools, practical APIs, transparent and collateralized markets, integrated news, and an expanding creator program. For active traders, its advanced order types and API access alone make it a platform worth watching. Outpoll is now globally accessible via outpoll.com and Google Play.

marsbitHá 43m

Outpoll Review: A Prediction Market Platform Built for Active Traders

marsbitHá 43m

Bitwise: Crypto Becomes a Contrarian Investment, Three Logics to Understand the Current Market

**Summary** Matt Hougan, Bitwise's CIO, analyzes the current crypto market through three key lenses, arguing it has shifted from a momentum-driven to a contrarian investment. **1) Crypto Becomes a Contrarian Play:** The market is weak, with major assets like Bitcoin and Ethereum down significantly. Capital has moved to hot sectors like AI, leaving crypto as an "unloved" asset class. This transforms crypto investing from trend-following to a test of patience and fundamental analysis. Investors now favor projects with solid fundamentals (e.g., Hyperliquid) over speculative ones. **2) Regulatory Overhang:** The uncertain fate of the U.S. CLARITY Act, a major crypto regulatory framework, is a key headwind. With its passage in 2024 seen as far from guaranteed (estimates range from 30-55%), institutional capital remains on the sidelines, choosing less risky alternatives like AI stocks. The market needs clarity—whether the bill passes or fails—more than any specific outcome to move decisively. **3) Capital Rotates to New Fundamentals:** This cycle differs from past bear markets where money fled to Bitcoin. Now, capital seeks smaller assets with strong use cases. While major cryptos fell in May 2024, tokens like Hyperliquid (+72%), Zcash (+50%), and XLM (+44%) rallied on their specific fundamentals. This rotation confirms the new contrarian, fundamentals-driven logic and signals the bear market may be in its later stages. **Conclusion:** Short-term pressure persists due to regulatory uncertainty and competition from AI narratives. Investing in crypto now requires a contrarian mindset—acting against the crowd and focusing on fundamental value. Patience and targeting high-quality projects based on their merits are essential for capturing long-term gains.

marsbitHá 1h

Bitwise: Crypto Becomes a Contrarian Investment, Three Logics to Understand the Current Market

marsbitHá 1h

ChatGPT Might Be Disappearing Soon

OpenAI announced at its "Intelligence at Work" event that its coding assistant, Codex, will be fully integrated into the ChatGPT app within weeks. This move marks a strategic shift from a conversational AI (Chat) towards a unified "agentic" platform capable of execution. Codex, originally launched to compete with Anthropic's Claude Code, has grown rapidly to 5 million weekly active users, with 20% being non-developers like analysts and designers. Its enterprise revenue now constitutes 40% of OpenAI's total. The integration is the first step in creating a super-app combining ChatGPT (interface), Codex (execution engine), and the Atlas browser (web access). OpenAI also unveiled new Codex features: specialized Agent plugins for six professional roles, an "Annotations" tool for direct document editing, and a "Sites" function to turn work into shareable web apps. Internally, this reflects a power shift; the Codex team now leads core product strategy. While the ChatGPT brand remains for its vast user base, the platform's future is focused on autonomous agents that perform tasks, not just chat. The article notes that competition with Claude Code pushed OpenAI's development, with Codex competing on cost-effectiveness and accessibility rather than raw coding quality. It concludes that the essence of "ChatGPT" is evolving from a chatbot into an AI agent platform, with the name potentially becoming a legacy symbol of its original function.

marsbitHá 1h

ChatGPT Might Be Disappearing Soon

marsbitHá 1h

Trading

Spot
Futuros

Artigos em Destaque

Como comprar ERA

Bem-vindo à HTX.com!Tornámos a compra de Caldera (ERA) simples e conveniente.Segue o nosso guia passo a passo para iniciar a tua jornada no mundo das criptos.Passo 1: cria a tua conta HTXUtiliza o teu e-mail ou número de telefone para te inscreveres numa conta gratuita na HTX.Desfruta de um processo de inscrição sem complicações e desbloqueia todas as funcionalidades.Obter a minha contaPasso 2: vai para Comprar Cripto e escolhe o teu método de pagamentoCartão de crédito/débito: usa o teu visa ou mastercard para comprar Caldera (ERA) instantaneamente.Saldo: usa os fundos da tua conta HTX para transacionar sem problemas.Terceiros: adicionamos métodos de pagamento populares, como Google Pay e Apple Pay, para aumentar a conveniência.P2P: transaciona diretamente com outros utilizadores na HTX.Mercado de balcão (OTC): oferecemos serviços personalizados e taxas de câmbio competitivas para os traders.Passo 3: armazena teu Caldera (ERA)Depois de comprar o teu Caldera (ERA), armazena-o na tua conta HTX.Alternativamente, podes enviá-lo para outro lugar através de transferência blockchain ou usá-lo para transacionar outras criptomoedas.Passo 4: transaciona Caldera (ERA)Transaciona facilmente Caldera (ERA) no mercado à vista da HTX.Acede simplesmente à tua conta, seleciona o teu par de trading, executa as tuas transações e monitoriza em tempo real.Oferecemos uma experiência de fácil utilização tanto para principiantes como para traders experientes.

471 Visualizações TotaisPublicado em {updateTime}Atualizado em 2026.06.02

Como comprar ERA

Discussões

Bem-vindo à Comunidade HTX. Aqui, pode manter-se informado sobre os mais recentes desenvolvimentos da plataforma e obter acesso a análises profissionais de mercado. As opiniões dos utilizadores sobre o preço de ERA (ERA) são apresentadas abaixo.

活动图片