Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbitPublished on 2026-05-28Last updated on 2026-05-28

Abstract

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

From Gold to Bitcoin: Fixed Supply + Institutional Frenzy, Might It Repeat the 'Explosive' Price Trend?

"From Gold to Bitcoin: Fixed Supply and Institutional Frenzy May Lead to 'Explosive' Price Rally Analysts suggest Bitcoin's price action could mirror gold's over the past two decades, following the launch of spot Bitcoin ETFs. Gold ETFs, introduced in 2004, drove gold's price surge to a current market cap near $28 trillion. Both gold and Bitcoin are non-yielding stores of value, with prices driven purely by investor sentiment rather than cash flows or credit. Gold ETFs experienced dramatic cycles: explosive growth, painful drawdowns, and slow recoveries, with each cycle reaching higher peaks. Bitcoin ETFs, approved in early 2024, saw rapid institutional adoption but are now facing similar volatility. Recent warnings highlight the risk of significant ETF outflows disrupting the current rebound. BlackRock's IBIT, a leading Bitcoin ETF, has sold nearly 100,000 BTC to meet redemptions while still holding over 733,000. The core parallel is fixed supply: when demand surges, prices explode, but demand is often volatile and wave-like, not steady. Institutional interest, through ETFs and corporate adoption, remains a key support pillar, helping to cushion sell-offs. If Bitcoin captures even a fraction of gold's role as a store of value, its upside potential is immense, though the path will be marked by high volatility. For investors, focusing on long-term trends and managing risk is crucial as this 'price explosion' narrative unfolds."

Foresight News5m ago

From Gold to Bitcoin: Fixed Supply + Institutional Frenzy, Might It Repeat the 'Explosive' Price Trend?

Foresight News5m ago

Why Is AI Agent Shopping Hard to Popularize?

The article argues that the popular narrative of "AI agent shopping" – equipping AI with a wallet to autonomously handle purchases – is fundamentally flawed and oversimplifies the complexity of shopping. It deconstructs shopping into two core actions: **information retrieval** (standardized, easily automated) and **value judgment** (deeply subjective and human-centric). The narrative mistakenly assumes AI can fully handle both. Value judgment itself has two layers: **evaluation** (assessing options against criteria) and **demand definition** (setting the criteria, weights, and values). The latter is inherently human and dynamic, as preferences are not fixed but constructed during the decision-making process ("constructive preferences"). The real dividing line for automation is not product standardization, but whether the **act of choosing** itself holds experiential value. For mundane purchases (e.g., printer paper), full AI delegation works. For experiential goods (e.g., wine, furniture), the joy of selection is core to consumption, so AI should act as an assistant that narrows options, leaving the final choice to humans. The "AI wallet" concept confuses three separate elements: decision-making, execution, and fund custody. Current payment industry solutions (e.g., from Stripe, Mastercard, Google, Visa) show that limited, scoped payment authorization tokens are sufficient for most consumer scenarios, not full fund custody. The true use case for autonomous AI wallets is in **B2B procurement** and **machine-to-machine (M2M) settlements** for standardized, high-frequency, low-value transactions. The real bottlenecks for AI shopping are not payment technology, but **1) the lack of trusted data sources** (e.g., fake reviews, counterfeit goods) and **2) the impossibility of automating human demand definition**. The conclusion is that the focus should be on safely automating the assessment and filtering process while reserving for humans the rights to define their criteria and enjoy the final act of choice. For experiential goods, the platform's competitive advantage shifts to providing a superior selection experience.

Foresight News1h ago

Why Is AI Agent Shopping Hard to Popularize?

Foresight News1h ago

zcashd shuts down, Zcash enters Ironwood era: Is quantum-resistant privacy the future?

Zcash has completed its infrastructure transition by retiring the original zcashd software and fully adopting the Rust-based Zebra and Zakura node implementations. This shift, finalizing in July 2024, enhances network maintainability and prepares for the upcoming Ironwood era. Despite a previously disclosed vulnerability in the Orchard shielded pool, user confidence appears resilient. Shielded transaction volume grew 11.1% quarter-over-quarter, and the anonymity set expanded significantly, even as total shielded balances saw a moderate decline. The prompt containment of the Orchard flaw, which did not threaten total ZEC supply, demonstrated effective protocol safeguards. The incoming Ironwood upgrade aims to further strengthen long-term security through formal verification and quantum-resistant features, moving Zcash from reactive fixes to proactive, verifiable security assurances.

ambcrypto1h ago

zcashd shuts down, Zcash enters Ironwood era: Is quantum-resistant privacy the future?

ambcrypto1h ago

After Nine Months of Shorting, a Full Turn to Long: Renowned Trader Opens Bitcoin Positions Around 64K, Crypto Market Long-Short Divergence Intensifies

After nine months of being short, prominent crypto trader Doctor Profit has closed all his bearish positions and started buying Bitcoin near $64,000, signaling a complete bullish reversal. He argues that structural market changes—such as impending U.S. regulation (CLARITY Act) and institutional adoption via securities tokenization—are rewriting the traditional four-year cycle script, potentially bringing the market bottom forward from the widely expected September/October timeframe. This view finds some technical support from on-chain analyst gumsays, who notes a bullish divergence on Bitcoin's weekly chart has persisted for 147 days, nearing the 161-day duration seen before the 2022 cycle low. However, cycle researcher Jake Pahor presents a counter-argument based on historical data. Analyzing patterns since 2014, he identifies three common features of past bear market bottoms: a ~12-month duration from peak to trough, a sustained period of extreme fear (with a proprietary risk score below 20), and the price falling below Bitcoin's realized price (~$53,000 currently). The current cycle, only nine months from its October 2025 peak, meets none of these conditions. The debate highlights a market torn between "front-running" a potential early bottom driven by new fundamentals and waiting for confirmation through traditional on-chain and sentiment metrics. While Doctor Profit opts for aggressive buying, Pahor maintains a disciplined, tiered accumulation strategy, continuing weekly buys at current risk levels but reserving larger orders for if more extreme fear emerges.

marsbit1h ago

After Nine Months of Shorting, a Full Turn to Long: Renowned Trader Opens Bitcoin Positions Around 64K, Crypto Market Long-Short Divergence Intensifies

marsbit1h ago

Senior Trader's Confession: How to Trade Market's False Expectations?

Veteran trader's case study: trading the market's "wrong expectations". This trade centered on a textbook "expectation error" after a weak CPI report. While the market initially priced in broad monetary easing (sending Nasdaq to 30,060), the crucial 30-year real yield hit a 20-year high. This signaled a fractured transmission mechanism: short-term rates eased, but long-term funding costs (vital for tech valuations) refused to fall. The trader executed five short positions on the Nasdaq (NQ) as it fell from 30,060 to 28,768. The core methodology: don't just trade the data, but analyze the market's implied causal chain and identify where it breaks. In this case, the chain was: Weak CPI → Policy Easing → Lower Long-Term Funding Costs → NQ Valuation Expansion. The break occurred between policy easing and long-term rates. The "veto variable" – long-term real yields – refused to confirm the bullish narrative. Trades were structured around "fast variables" (price) temporarily repairing while "slow variables" (funding conditions) remained broken. The article outlines a repeatable framework: 1) Map the market's implied causal chain. 2) Identify the veto variable. 3) Observe if it rejects the narrative. 4) Enter when price still follows the old script. 5) Choose the cleanest asset expression (e.g., short NQ, not broad S&P). 6) Define both invalidation and fulfillment exit conditions. The key insight: Alpha often comes not from an information edge, but from a "reaction function edge" – recognizing when the market is applying an outdated causal logic to new data. The critical question: What causal chain is the market's first reaction relying on, and is that chain still valid today?

marsbit1h ago

Senior Trader's Confession: How to Trade Market's False Expectations?

marsbit1h ago

Trading

Spot

Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

Abstract

2025, Who is Redrawing the Starting Line for AGI?

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Related Questions

Related Reads

From Gold to Bitcoin: Fixed Supply + Institutional Frenzy, Might It Repeat the 'Explosive' Price Trend?

Why Is AI Agent Shopping Hard to Popularize?

zcashd shuts down, Zcash enters Ironwood era: Is quantum-resistant privacy the future?

After Nine Months of Shorting, a Full Turn to Long: Renowned Trader Opens Bitcoin Positions Around 64K, Crypto Market Long-Short Divergence Intensifies

Senior Trader's Confession: How to Trade Market's False Expectations?

Trading