Auto Research Era: 47 Tasks Without Standard Answers Become the Must-Test Leaderboard for Agent Capabilities

marsbitXuất bản vào 2026-05-13Cập nhật gần nhất vào 2026-05-13

Tóm tắt

The article introduces Frontier-Eng Bench, a new benchmark for AI agents developed by Einsia AI's Navers lab. Unlike traditional tests with clear answers, this benchmark presents 47 complex, real-world engineering tasks—such as optimizing underwater robot stability, battery fast-charging protocols, or quantum circuit noise control—where there is no single correct solution, only continuous optimization towards a limit. It shifts AI evaluation from static knowledge retrieval to a dynamic "engineering closed-loop": the AI must propose solutions, run simulations, interpret errors, adjust parameters, and re-run experiments to iteratively improve performance. This process tests an agent's ability to learn and evolve through long-term feedback, much like a human engineer tackling trade-offs between power, safety, and performance. Key findings from the benchmark reveal two patterns: 1) Improvements follow a power-law decay, becoming harder and smaller as optimization progresses, and 2) While exploring multiple solution paths (breadth) helps, sustained depth in a single path is crucial for breakthrough innovations. The research suggests this marks a step toward "Auto Research," where AI systems can autonomously conduct continuous, tireless optimization in scientific and engineering domains. Humans would set high-level goals, while AI agents handle the iterative experimentation and refinement. This could fundamentally change research and development workflows.

If we throw AI into an engineering site with no standard answers, can it still survive?

For a long time, AI Agents have appeared omnipotent, but in reality, most are just 'flipping through memories' within known knowledge bases.

Yet the real engineering world is harsh: the stability of underwater robots, the lithium plating boundary of power batteries, the noise control of quantum circuits... These problems have no 'perfect score', only 'optimizations that inch closer to the limit'.

Recently, the Agent Benchmark released by Navers lab under Einsia AI—Frontier-Eng Bench—officially tore off the label of AI being an 'exam-crammer'.

The research team didn't have AI grind through outdated coding problems. Instead, they gave it a complete 'engineering closed loop': propose a solution, connect to the simulator, digest errors, adjust parameters, and re-run.

Faced with 47 hardcore tasks spanning multiple disciplines, AI must behave like a senior engineer, seeking the optimal solution within the 'impossible triangle' of power consumption, safety, and performance.

This is not just a test suite; it's more like a rehearsal for Agent 'evolution'.

When AI begins to learn self-correction from feedback, the Auto Research era, where 'humans set goals and AI iterates non-stop 24/7', might be closer than we imagine.

AI Starts Tackling 'Hard Work'

Past large language models were more like super straight-A students.

You pose a question, it 'flips through memory' from massive training data, then pieces together an answer that seems plausible.

In this mode, the large model is essentially playing 'word chain', not solving real-world problems.

But the emergence of Frontier-Eng Bench has AI doing the work of 'engineering optimization'.

The process has shifted to letting AI first propose a solution, then connect to a simulator to run experiments, subsequently obtain feedback and errors, modify parameters and code, and continue re-running until performance improves further.

In this closed-loop system, AI's identity undergoes a qualitative change.

Want to make the underwater robot more stable? AI must start automatically tuning the controller.

Want to increase the speed of the robotic arm a bit more? AI has to run simulations itself.

To some extent, AIs have shed their purely semantic understanding role and begun to act like professional engineers, continuously optimizing based on real-world environmental feedback.

△

The most interesting aspect of Frontier-Eng Bench is: it doesn't test whether AI 'answered correctly', but rather whether AI can continuously become stronger.

Because real engineering optimization is never about multiple-choice questions; there is no single standard answer.

Take fast-charging batteries as an example: the goal sounds simple—charge as fast as possible, but reality isn't so easy.

Under strict constraints like temperature mustn't spike, voltage can't overspeed, battery life can't drop too fast, and lithium plating must be avoided, AI must precisely hit the balance point of performance.

This means AI cannot pass through by any clever 'test-cramming' tricks; it must demonstrate endurance for continuous evolution through long-term feedback.

Can AI perform long-term optimization in real environments?

Looking at the results, GPT5.4 showed the most stable overall performance, but AIs still have a long way to go before 'solving' the Benchmark.

△

Auto Research Enters the 'Iterative Optimization' Era

The research team raised a very interesting point in their paper:

Truly advanced intelligence essentially relies on long-term feedback loops.

Just as AlphaGo could defeat Lee Sedol, it lay in the vast number of simulations and immediate feedback behind each decision, not the rote memorization of established game records.

True scientific research is the same: top labs don't rely on a single burst of inspiration, but continuously propose hypotheses, run experiments, examine results, modify plans, and try again.

Engineering optimization follows the same principle: anyone can create the first version; what's truly difficult is that final 1% performance leap.

The significance of Frontier-Eng Bench lies here: For the first time, it systematically begins testing AI's 'iterative optimization capability', and has summarized two nearly brutal laws of AI evolution.

△

The first law is: The further you go, the harder the improvement.

This paper found that the frequency and magnitude of Agent improvements follow a power-law decay:

Improvement frequency ∝ 1 / iteration count
Improvement magnitude ∝ 1 / improvement count

Simply put: the fastest gains come in the first few rounds, and it gets progressively harder and smaller later on.

This closely resembles the real R&D process: the first version of AI can quickly eliminate many 'low-hanging fruits', but the closer it gets to the bottleneck, the more effort is required to squeeze out even a bit more performance.

Would it be more cost-effective to explore multiple paths in parallel for trial and error? The answer lies in the second law.

△

The second law: Breadth is useful, but depth is even more indispensable.

Running multiple parallel paths can avoid getting stuck, but with a fixed budget, each additional chain opened shallows the depth of exploration.

Many engineering breakthroughs require continuous accumulation and constant correction before structural leaps emerge; they can't be achieved simply by 'trying a few more times'.

This actually points towards the development direction of next-generation Agents: not models that 'output an answer once', but systems that can continuously iterate and self-evolve within long-term feedback loops.

AI Engineers Might Really Be Coming

The true far-reaching significance of this research lies in its preliminary outline of an AI system beginning to approach the real engineering cycle.

△

Imagine when AI connects to industrial software, simulation environments, CAD systems, chip design tools, scientific computing platforms...

A dramatic transformation in the modality of productivity is on the verge of emerging.

In future labs, a division of labor like this might appear:

Human researchers are responsible for proposing directions and goals.

For example, 'reduce this component's energy consumption by 30%', 'compress this model's forward pass GPU usage even lower', 'increase the stability of robot control a bit more', 'push the fidelity of this quantum circuit closer to the limit', etc.

And AI is responsible for 'grinding the path'. They focus on these goals, continuously optimizing.

For example, automatically running simulations and experiments, automatically reading feedback from verifiers and simulators, then continuing to modify and optimize, iterating non-stop 24/7.

This evolutionary logic frees AI from the identity of an 'assistive tool', allowing it to begin solving complex system problems like a real engineering team—and tirelessly at that.

And the issues revealed by the Frontier-Eng Benchmark are actually very direct:

When AI begins to learn 'long-term optimization', how far is it from true engineering intelligence?

Paper Title: Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Project Homepage: https://lab.einsia.ai/frontier-eng/

Arxiv: https://arxiv.org/abs/2604.12290

GitHub repo: https://github.com/EinsiaLab/Frontier-Engineering

This article is from the WeChat public account "Quantum Bit", author: Yun Zhong

Tiền kỹ thuật số thịnh hành

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

PancakeSwapCAKE

Câu hỏi Liên quan

QWhat is the main purpose of the Frontier-Eng Benchmark released by Einsteina AI's Navers lab?

AThe main purpose of the Frontier-Eng Benchmark is to move beyond testing AI's ability to recall known information. It systematically tests AI agents' capability for 'iterative optimization' on 47 real-world, open-ended engineering tasks without standard answers, evaluating if they can continuously improve performance through a feedback loop involving simulation, error analysis, and parameter adjustment.

QHow does the AI's role change in the Frontier-Eng Benchmark testing process compared to traditional language models?

AIn the Frontier-Eng Benchmark, the AI transitions from acting as a 'super student' that retrieves and assembles answers from training data to performing 'engineering optimization.' Its role becomes akin to a professional engineer: it proposes solutions, runs simulations, analyzes feedback and errors, modifies parameters/code, and reruns experiments in a continuous loop to seek optimal performance under complex constraints.

QWhat are the two key 'AI evolution laws' discovered through the Frontier-Eng Benchmark regarding iterative optimization?

AThe two key laws are: 1) Improvements become progressively harder and smaller (showing a power-law decay: Improvement frequency ∝ 1/iteration count, Improvement magnitude ∝ 1/improvement count). 2) While exploring multiple parallel paths (breadth) is useful, sustained depth in a single optimization path is more critical for achieving structural breakthroughs, as fixed budgets force a trade-off between breadth and depth.

QWhat future work paradigm does the article suggest might emerge from the development of self-evolving AI agents?

AThe article suggests a future 'Auto Research' paradigm where human researchers define the goals and direction (e.g., 'reduce component energy consumption by 30%'), and AI agents take on the role of 'grinding the path.' They would work autonomously and tirelessly—running simulations, interpreting feedback from verifiers and simulators, and iteratively optimizing—24/7 to approach performance limits.

QAccording to the article, what fundamental shift in AI capability does the Frontier-Eng Benchmark represent?

AThe Frontier-Eng Benchmark represents a fundamental shift from evaluating AI's ability to find predetermined 'correct answers' to testing its capacity for 'self-evolution' through long-term feedback loops. It moves the focus to whether AI can demonstrate sustained learning and improvement in complex, real-world scenarios with no single correct answer, pushing AI closer to genuine engineering intelligence.

Nội dung Liên quan

Làm thế nào để phát hiện một trò lừa đảo tiền điện tử hoặc một vụ 'rug pull'?

Hãy tưởng tượng phát hiện một đồng token mới hứa hẹn, nhưng sau đó, tính thanh khoản đột ngột biến mất. Đó là một vụ "rug pull" điển hình. Các nhà phát triển thường tạo dựng niềm tin thông qua tính thanh khoản bị khóa và hợp đồng từ bỏ quyền sở hữu, đồng thời tạo sự phấn khích trên mạng xã hội. Quá trình này thường diễn ra chỉ trong 48-72 giờ. Các dấu hiệu cảnh báo thường xuất hiện cùng nhau: sự tập trung token cao vào một số ví nhà phát triển (trên 30% tổng cung), hợp đồng thông minh có mã nguồn không được xác minh, chức năng phát hành token ẩn hoặc có thể nâng cấp. Hành vi thị trường như giá tăng nhanh nhờ quảng bá của người có ảnh hưởng, khối lượng giao dịch hữu cơ thấp cũng là tín hiệu đáng ngờ. Các hình thức lừa đảo phổ biến bao gồm "bẫy mật ong" (cho phép mua nhưng ngăn cản bán), chức năng phát hành token ẩn làm loãng giá trị, và tuyên bố từ bỏ quyền sở hữu giả mạo. Để tự bảo vệ, nhà đầu tư cần kiểm tra kỹ lưỡng hợp đồng thông minh, cơ cấu phân phối token và ưu tiên các dự án minh bạch, đã được kiểm toán. Cuối cùng, mức độ phi tập trung thực sự phụ thuộc vào việc các nhà phát triển từ bỏ quyền kiểm soát bao nhiêu.

ambcrypto31 phút trước

Làm thế nào để phát hiện một trò lừa đảo tiền điện tử hoặc một vụ 'rug pull'?

ambcrypto31 phút trước

Lãi Mở XRP Chạm Mốc 2,6 Tỷ USD Khi Nhu Cầu Phái Sinh Tăng

Dữ liệu từ CoinGlass cho thấy lãi mở (open interest) của hợp đồng tương lai XRP đã tăng hơn 10% trong 24 giờ, đạt mốc 2,6 tỷ USD, đưa XRP trở thành một trong những tài sản tiền điện tử có lãi mở phái sinh lớn nhất. Mức tăng này phản ánh nhu cầu giao dịch phái sinh quanh XRP đang gia tăng đáng kể. Tuy nhiên, lãi mở tăng cho thấy nhiều vị thế đang được mở ra, nhưng không tự động chỉ ra rằng dòng tiền này lạc quan hay bi quan về giá. Nó có thể đến từ các vị thế mua, bán khống, phòng ngừa rủi ro hoặc giao dịch đòn bẩy. Do đó, sự tích tụ này có thể hỗ trợ một đợt biến động mạnh hơn, nhưng cũng làm tăng rủi ro biến động và thanh lý. Các nhà giao dịch cần xem xét thêm các chỉ số khác như phí funding, khối lượng giao dịch spot, hướng giá và dữ liệu thanh lý để có bức tranh đầy đủ. Sự gia tăng lãi mở cho thấy XRP đang thu hút sự chú ý nghiêm túc từ thị trường phái sinh, nhưng sự bền vững của xu hướng sẽ phụ thuộc vào việc liệu hoạt động spot có tăng theo để xác nhận hay không. Kết quả cuối cùng có thể là một xu hướng mạnh hơn hoặc đơn giản là thêm biến động cho một thị trường vốn đã sôi động.

bitcoinist1 giờ trước

Lãi Mở XRP Chạm Mốc 2,6 Tỷ USD Khi Nhu Cầu Phái Sinh Tăng

bitcoinist1 giờ trước

Dự đoán giá Bitcoin năm 2030: Đây là những điều bạn nên biết về đợt tăng giá tiếp theo

Giá Bitcoin đang trong xu hướng giảm kể từ đợt sụt giảm vào ngày 10 tháng 10 năm 2025. Để xác định đáy của thị trường gấu, một chỉ báo quan trọng cần theo dõi là dòng tiền stablecoin chảy vào các sàn giao dịch, vì dòng chảy mạnh thường báo hiệu sự thay đổi tâm lý và thúc đẩy đà tăng giá. Nhà phân tích Joao Wedson dựa trên phân tích fractal dự đoán rằng đáy chu kỳ này có thể nằm trong khoảng $41,5k-$45k và đạt được vào nửa đầu tháng 10 năm 2026. Tuy nhiên, đây không phải là dự đoán chắc chắn mà chỉ dựa trên mô hình lịch sử. Xem xét về triển vọng giá Bitcoin đến năm 2030, phân tích kỹ thuật sử dụng các mức Fibonacci gợi ý rằng BTC có thể giảm về vùng $39,1k (gần với mức dự báo của Wedson) trước khi tiếp tục xu hướng tăng dài hạn. Nếu lặp lại kịch bản tương tự chu kỳ trước, đà tăng có thể vượt qua mức mở rộng 61.8% ở $152,3k và hướng tới mức cao trong khoảng $200k-$220k vào năm 2030, trước khi bước vào chu kỳ gấu tiếp theo. Cần lưu ý rằng chu kỳ hiện tại có thể kéo dài hơn so với trước đây.

ambcrypto2 giờ trước

Dự đoán giá Bitcoin năm 2030: Đây là những điều bạn nên biết về đợt tăng giá tiếp theo

ambcrypto2 giờ trước

Nhịp Đập Thị Trường BTC: Tuần 30

Bitcoin (BTC) đã hồi phục từ dưới 58.000 USD để thử nghiệm mức 65.000 USD trước khi đi vào giai đoạn củng cố quanh 64.500 USD. Động lượng tăng đã chậm lại và khối lượng giao dịch spot vẫn ở mức thấp. Mặc dù vậy, sự phục hồi được duy trì trong bối cảnh thị trường đang tìm kiếm điểm cân bằng vững chắc hơn. Chênh lệch biến động (volatility spreads) thu hẹp cho thấy thị trường phái sinh không còn định giá phí bảo hiểm rủi ro cao, phản ánh tâm lý phòng thủ giảm bớt. Dù sự tham gia của thị trường giao ngay còn yếu, khẩu vị đầu cơ đang dần quay trở lại. Lãi suất mở (open interest) cho hợp đồng tương lai và quyền chọn tăng lên, dòng tiền của nhà giao dịch vĩnh viễn (perpetual taker flow) chuyển sang mua ròng và nhu cầu bảo vệ trước rủi ro giảm xuống. Hoạt động on-chain cũng đang ổn định, được hỗ trợ bởi sự cải thiện vừa phải về thông lượng kinh tế và sự tham gia của người dùng. Dòng vốn vẫn thận trọng, nhưng dòng tiền ETF spot tại Mỹ đang phục hồi và các nhóm ETF đang trở lại gần mức hòa vốn, cho thấy áp lực bán từ tổ chức đang giảm dần. Nhìn chung, thị trường Bitcoin dường như ngày càng cân bằng hơn, với niềm tin dài hạn tạo đà hỗ trợ trong khi sự tham gia đầu cơ vẫn được kiểm soát. Tuy nhiên, tỷ trọng ngày càng tăng của vốn ngắn hạn nhạy cảm với giá cả làm tăng khả năng biến động mạnh hơn, khiến thị trường vẫn kiên cường nhưng ngày càng nhạy cảm với sự thay đổi trong động lượng và áp lực bán.

insights.glassnode3 giờ trước

insights.glassnode3 giờ trước

Nhu cầu Bitcoin tại thị trường giao ngay suy yếu khi vốn mới do dự bất chấp dòng tiền vào ETF

Mặc dù dòng tiền vào các quỹ ETF Bitcoin đã chuyển sang tích cực kể từ giữa tháng 7, nhưng điều này vẫn chưa đủ để đưa giá Bitcoin vượt lên vùng cung địa phương quanh mốc 65.000 USD. Theo phân tích từ CryptoQuant, nhu cầu mua Bitcoin trên thị trường giao ngay (spot) trong 30 ngày đã suy yếu đáng kể, giảm từ -80.000 BTC xuống -170.000 BTC. Sự ổn định tương đối của giá hiện tại chủ yếu được hỗ trợ bởi các lệnh mua để đóng vị thế bán (short-covering) trên thị trường phái sinh và áp lực bán từ các nhà đầu tư ngắn hạn đã giảm bớt. Chỉ số "Nhà đầu tư mới vào Bitcoin", đo lường tỷ trọng vốn hóa nắm giữ bởi các đồng coin non trẻ (dưới 1 tháng tuổi), vẫn ở gần mức thấp nhất trong năm, cho thấy sự tham gia của dòng vốn mới còn yếu. Tỷ lệ lợi nhuận trên đầu ra đã chi tiêu của nhà đầu tư ngắn hạn (STH SOPR) cũng duy trì dưới mức 1.0, có nghĩa là họ vẫn đang chốt lỗ trung bình. Những chỉ báo này cùng củng cố quan điểm rằng thị trường hiện đang trong giai đoạn ổn định cục bộ, chứ chưa có dấu hiệu đảo chiều tăng mạnh. Các nhà phân tích nhấn mạnh rằng, cần theo dõi việc giá có thể vượt qua đỉnh dao động địa phương 67.300 USD hay không để xác nhận một sự đảo chiều thực sự.

ambcrypto5 giờ trước

Nhu cầu Bitcoin tại thị trường giao ngay suy yếu khi vốn mới do dự bất chấp dòng tiền vào ETF

ambcrypto5 giờ trước

Giao dịch

Giao ngay

Bài viết Nổi bật

Làm thế nào để Mua ERA

Chào mừng bạn đến với HTX.com! Chúng tôi đã làm cho mua Caldera (ERA) trở nên đơn giản và thuận tiện. Làm theo hướng dẫn từng bước của chúng tôi để bắt đầu hành trình tiền kỹ thuật số của bạn.Bước 1: Tạo Tài khoản HTX của BạnSử dụng email hoặc số điện thoại của bạn để đăng ký tài khoản miễn phí trên HTX. Trải nghiệm hành trình đăng ký không rắc rối và mở khóa tất cả tính năng. Nhận Tài khoản của tôiBước 2: Truy cập Mua Crypto và Chọn Phương thức Thanh toán của BạnThẻ Tín dụng/Ghi nợ: Sử dụng Visa hoặc Mastercard của bạn để mua Caldera (ERA) ngay lập tức.Số dư: Sử dụng tiền từ số dư tài khoản HTX của bạn để giao dịch liền mạch.Bên thứ ba: Chúng tôi đã thêm những phương thức thanh toán phổ biến như Google Pay và Apple Pay để nâng cao sự tiện lợi.P2P: Giao dịch trực tiếp với người dùng khác trên HTX.Thị trường mua bán phi tập trung (OTC): Chúng tôi cung cấp những dịch vụ được thiết kế riêng và tỷ giá hối đoái cạnh tranh cho nhà giao dịch.Bước 3: Lưu trữ Caldera (ERA) của BạnSau khi mua Caldera (ERA), lưu trữ trong tài khoản HTX của bạn. Ngoài ra, bạn có thể gửi đi nơi khác qua chuyển khoản blockchain hoặc sử dụng để giao dịch những tiền kỹ thuật số khác.Bước 4: Giao dịch Caldera (ERA)Giao dịch Caldera (ERA) dễ dàng trên thị trường giao ngay của HTX. Chỉ cần truy cập vào tài khoản của bạn, chọn cặp giao dịch, thực hiện giao dịch và theo dõi trong thời gian thực. Chúng tôi cung cấp trải nghiệm thân thiện với người dùng cho cả người mới bắt đầu và người giao dịch dày dạn kinh nghiệm.

Tổng lượt xem 811Xuất bản vào 2025.07.17Cập nhật vào 2026.06.02

Thảo luận

Chào mừng đến với Cộng đồng HTX. Tại đây, bạn có thể được thông báo về những phát triển nền tảng mới nhất và có quyền truy cập vào thông tin chuyên sâu về thị trường. Ý kiến của người dùng về giá của ERA (ERA) được trình bày dưới đây.