Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbitPublished on 2026-05-28Last updated on 2026-05-28

Abstract

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

Related Questions

QWhat is the core argument of the article regarding current AI models and AGI?

AThe article argues that current large AI models, despite excelling in exams, are getting further from true AGI because they rely on large-scale data approximation rather than genuine understanding. They lack the core abilities of adaptation, causal reasoning, and active experimentation needed for true intelligence.

QAccording to the paper by Michael Timothy Bennett discussed in the article, what is a proposed new definition for AGI?

AMichael Timothy Bennett proposes defining AGI as an 'artificial scientist'—a system that can adapt widely, efficiently, and scientifically to new environments and tasks under real-world constraints (like computation, memory, and energy), just like a human scientist. The focus is on the ability to discover new knowledge, not just mimic humans.

QWhat are the three key behavioral shifts that the 'artificial scientist' framework suggests an AGI must achieve, as outlined in the article?

AThe three key shifts are: 1) From 'passive learner' to 'active experimenter' (planning and executing experiments to gather information). 2) From 'knowing what' to 'knowing why' (understanding causal relationships, not just correlations). 3) Mastering the 'exploration-exploitation trade-off' (dynamically balancing the use of known knowledge and the search for new knowledge under resource constraints).

QWhat are the three 'meta-methods' for building intelligent systems that the article describes, and what is the proposed path to AGI?

AThe three meta-methods are: 1) Scale-maxing (maximizing data, parameters, compute). 2) Simp-maxing (maximizing architectural simplicity). 3) W-maxing (maximizing constraint weakening to find optimal solutions). The article argues that AGI will not be achieved through a single meta-method like Scale-maxing alone, but rather through the integration of multiple, diverse approaches.

QHow would the acceptance of the 'artificial scientist' definition change how we evaluate AI systems, according to the article?

AThe focus would shift from evaluating based on performance on human-centric benchmarks and exams to creating new 'adaptability benchmarks.' These would test an AI's ability to discover patterns in unseen physical environments, learn new game rules faster than humans, or solve real scientific problems by autonomously forming and testing hypotheses. The core metric becomes 'how much can you discover,' not 'how much do you know.'

Related Reads

A Nation Blocks Chips, a Giant Buys a Nuclear Power Plant: Why It's Time to Seriously Consider DeAI

**Title: Great Powers Blockade Chips, Giants Buy Nuclear Plants: Why It's Time to Seriously Consider DeAI** In May 2026, the US closed loopholes for Chinese firms to acquire advanced NVIDIA chips via overseas subsidiaries. That same month, Kenya halted a $1B geothermal data center project involving Microsoft, fearing its immense energy consumption. Meanwhile, Huawei announced mass production of its Ascend AI chip. These disparate events underscore a new reality: the competition for computing power ("compute") has escalated beyond the tech industry, becoming a geopolitical and infrastructural battleground. A new era of oligopoly is forming, with control over the AI stack—from GPU chips (NVIDIA) and cloud platforms (AWS, Azure, Google Cloud) to foundational models (OpenAI, Anthropic)—concentrating in a few Western "AI Octopus" corporations. This centralization creates systemic risks: pricing power and platform lock-in for users, infrastructure fragility, and a widening "compute divide" that threatens to marginalize nations without independent AI capacity. An "AI Iron Curtain" is deepening through export controls. In response, some nations like Saudi Arabia and the UAE are investing heavily to buy compute power, aiming to transition from oil to AI economies. The EU seeks to triple its compute capacity by 2030 to reduce dependency. However, the spending gap is vast, with four US tech giants alone planning ~$750B in AI capex for 2026. The race is increasingly constrained by energy, with AI tasks consuming up to 1000x more power than web searches, pushing firms to even acquire nuclear plants. This landscape is fueling interest in Decentralized AI (DeAI). It proposes a third way: using open protocols to coordinate a global network of idle GPUs, independent developers, and data centers, creating an AI infrastructure without a single controlling entity. Leveraging blockchain and cryptographic verification, DeAI aims to break market concentration, disperse energy demands, reduce geopolitical dependencies, and enhance transparency. While still nascent in performance and stability, DeAI's core promise is not immediate superiority but providing a crucial alternative architecture to resist monopoly, censorship, and centralized power. As specialized AI hardware costs fall and open-source models flourish, the window to build this foundation is open. The very existence of such competition serves as a vital check against the inevitable abuse of concentrated power.

marsbit46m ago

A Nation Blocks Chips, a Giant Buys a Nuclear Power Plant: Why It's Time to Seriously Consider DeAI

marsbit46m ago

Outpoll Review: A Prediction Market Platform Built for Active Traders

Outpoll Review: A Prediction Market Platform Built for Active Traders In recent years, prediction markets have grown from a niche sector to a mainstream arena, attracting billions in trading volume and institutional capital. However, the user experience and tools for traders have not kept pace. Outpoll, a new global prediction market platform, aims to fill this gap by providing enhanced trading infrastructure for active and professional traders. Built on standard prediction market principles, Outpoll allows users to trade on the outcome of specific events. It uses fully collateralized contracts with USDC settlement, charges a competitive 0.1% fee per trade, and provides clear settlement rules upfront to minimize disputes. A key focus for Outpoll is its professional-grade trading tools. The platform supports limit and market orders, as well as take-profit and stop-loss orders for open positions—features uncommon in prediction markets. For automated trading, Outpoll offers comprehensive REST and WebSocket APIs, enabling portfolio management, price arbitrage, and integration with existing tools. The platform also features a creator-led market model, where approved experts and community leaders can create and manage markets for niche topics under platform supervision. Its integrated interface combines news feeds directly with trading functions, allowing users to monitor events and manage positions seamlessly. Outpoll launched with a native Android app (available on Google Play) and plans an iOS version later this year. In summary, Outpoll distinguishes itself with trader-focused tools, practical APIs, transparent and collateralized markets, integrated news, and an expanding creator program. For active traders, its advanced order types and API access alone make it a platform worth watching. Outpoll is now globally accessible via outpoll.com and Google Play.

marsbit54m ago

Outpoll Review: A Prediction Market Platform Built for Active Traders

marsbit54m ago

Bitwise: Crypto Becomes a Contrarian Investment, Three Logics to Understand the Current Market

**Summary** Matt Hougan, Bitwise's CIO, analyzes the current crypto market through three key lenses, arguing it has shifted from a momentum-driven to a contrarian investment. **1) Crypto Becomes a Contrarian Play:** The market is weak, with major assets like Bitcoin and Ethereum down significantly. Capital has moved to hot sectors like AI, leaving crypto as an "unloved" asset class. This transforms crypto investing from trend-following to a test of patience and fundamental analysis. Investors now favor projects with solid fundamentals (e.g., Hyperliquid) over speculative ones. **2) Regulatory Overhang:** The uncertain fate of the U.S. CLARITY Act, a major crypto regulatory framework, is a key headwind. With its passage in 2024 seen as far from guaranteed (estimates range from 30-55%), institutional capital remains on the sidelines, choosing less risky alternatives like AI stocks. The market needs clarity—whether the bill passes or fails—more than any specific outcome to move decisively. **3) Capital Rotates to New Fundamentals:** This cycle differs from past bear markets where money fled to Bitcoin. Now, capital seeks smaller assets with strong use cases. While major cryptos fell in May 2024, tokens like Hyperliquid (+72%), Zcash (+50%), and XLM (+44%) rallied on their specific fundamentals. This rotation confirms the new contrarian, fundamentals-driven logic and signals the bear market may be in its later stages. **Conclusion:** Short-term pressure persists due to regulatory uncertainty and competition from AI narratives. Investing in crypto now requires a contrarian mindset—acting against the crowd and focusing on fundamental value. Patience and targeting high-quality projects based on their merits are essential for capturing long-term gains.

marsbit1h ago

Bitwise: Crypto Becomes a Contrarian Investment, Three Logics to Understand the Current Market

marsbit1h ago

Trading

Spot
Futures
活动图片