Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbitPublicado em 2026-05-28Última atualização em 2026-05-28

Resumo

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

Perguntas relacionadas

QWhat is the core argument of the article regarding current AI models and AGI?

AThe article argues that current large AI models, despite excelling in exams, are getting further from true AGI because they rely on large-scale data approximation rather than genuine understanding. They lack the core abilities of adaptation, causal reasoning, and active experimentation needed for true intelligence.

QAccording to the paper by Michael Timothy Bennett discussed in the article, what is a proposed new definition for AGI?

AMichael Timothy Bennett proposes defining AGI as an 'artificial scientist'—a system that can adapt widely, efficiently, and scientifically to new environments and tasks under real-world constraints (like computation, memory, and energy), just like a human scientist. The focus is on the ability to discover new knowledge, not just mimic humans.

QWhat are the three key behavioral shifts that the 'artificial scientist' framework suggests an AGI must achieve, as outlined in the article?

AThe three key shifts are: 1) From 'passive learner' to 'active experimenter' (planning and executing experiments to gather information). 2) From 'knowing what' to 'knowing why' (understanding causal relationships, not just correlations). 3) Mastering the 'exploration-exploitation trade-off' (dynamically balancing the use of known knowledge and the search for new knowledge under resource constraints).

QWhat are the three 'meta-methods' for building intelligent systems that the article describes, and what is the proposed path to AGI?

AThe three meta-methods are: 1) Scale-maxing (maximizing data, parameters, compute). 2) Simp-maxing (maximizing architectural simplicity). 3) W-maxing (maximizing constraint weakening to find optimal solutions). The article argues that AGI will not be achieved through a single meta-method like Scale-maxing alone, but rather through the integration of multiple, diverse approaches.

QHow would the acceptance of the 'artificial scientist' definition change how we evaluate AI systems, according to the article?

AThe focus would shift from evaluating based on performance on human-centric benchmarks and exams to creating new 'adaptability benchmarks.' These would test an AI's ability to discover patterns in unseen physical environments, learn new game rules faster than humans, or solve real scientific problems by autonomously forming and testing hypotheses. The core metric becomes 'how much can you discover,' not 'how much do you know.'

Leituras Relacionadas

Judgment from a Crypto VC: The Final Stop is Here, All Passengers Please Disembark

A crypto VC firm declares the end of the line: the era driven by retail speculation and crypto-native ideology is over. The future belongs to the large-scale, institutional adoption of blockchain technology, stripped of its decentralized ethos. While retail became distracted by memecoins, major institutions—banks, payment giants—entered en masse, recognizing blockchain's unparalleled efficiency for value transfer. Their goal isn't to embrace decentralization but to build proprietary, controlled networks, adopting the technology while discarding its foundational philosophy. This marks the transition from a "crypto industry" to a "digital asset economy"—a foundational layer powering mainstream finance, not a separate rebellion. Trillions in assets are poised for tokenization, but largely through traditional, regulated channels. For builders and investors, the old playbook of launching low-float, high-FDV tokens for retail speculation is dead. The new imperative is to build robust infrastructure that serves institutional needs: compliance, security, and seamless integration into existing financial systems. The real opportunity lies not in fighting this shift but in enabling it, as institutions become the primary conduit for onboarding the next billion users and tokenizing the next hundred trillion dollars in assets. The game has fundamentally changed.

marsbitHá 12m

Judgment from a Crypto VC: The Final Stop is Here, All Passengers Please Disembark

marsbitHá 12m

Looking at the Answers Before Submitting the Test? Google Engineer Entangled in Polymarket Insider Trading Case

A Google security engineer, Michele Spagnuolo, has been arrested and charged with commodities fraud, wire fraud, and money laundering. He is accused of using internal Google tools to access non-public data on search trends for 2025's most-searched personalities and then trading on that information via an associated account ("AlphaRaccoon") on the prediction market platform Polymarket, netting over $1.2 million in profits. A key example involved trading on the rising search popularity of singer D4vd hours after viewing the internal data. Prosecutors traced the funds from the Polymarket account through cryptocurrency swaps and privacy tools, with some proceeds eventually reaching an Italian payment account opened with Spagnuolo's identification. Google stated it is cooperating with authorities and has suspended the employee, noting the misuse of confidential information is a serious policy violation. The case highlights deeper regulatory challenges for Polymarket, which faces scrutiny over user identification and the integrity of trades based on non-public information. Polymarket has stated it is cooperating with U.S. investigators and emphasized the traceability of blockchain transactions.

Odaily星球日报Há 22m

Looking at the Answers Before Submitting the Test? Google Engineer Entangled in Polymarket Insider Trading Case

Odaily星球日报Há 22m

Heat Rises for TermMax and Renaiss: How to Engage with These Two Projects Incubated by YZi Labs?

The article introduces TermMax and Renaiss, two projects recently gaining attention from the YZi Labs EASY Residency program incubator. It explains how users can potentially participate in them. TermMax is a DeFi protocol offering fixed-rate, fixed-term lending, functioning as a "lending AMM" to provide predictable interest rates. It has launched its V2 version, which improves capital efficiency and liquidity. Users can engage through its "Season 0" event on its V2 platform by connecting a wallet, daily check-ins, and completing tasks (deposit, trading, social) to earn points. Renaiss is a RWA infrastructure project focused on "collectible finance," specifically tokenizing real-world collectible trading cards (like Pokémon) onto the blockchain to enhance their liquidity and enable financial applications. Users can participate by visiting its official website, funding their account, and purchasing card packs (e.g., the OMEGA pack) with varying rarity tiers for a chance to obtain different card grades. Both projects represent specialized approaches within DeFi and RWA, aiming to offer users new avenues for fixed-income lending and investing in tokenized physical collectibles.

Odaily星球日报Há 1h

Heat Rises for TermMax and Renaiss: How to Engage with These Two Projects Incubated by YZi Labs?

Odaily星球日报Há 1h

Morning News | Coinbase Partners with Standard Chartered to Expand Multi-Currency Fiat Channels; Sharplink and Forward to be Included in Russell Indices; JPMorgan May Issue Stablecoin in the Future

Daily Crypto Recap: Key Developments Institutional adoption continues: Coinbase partners with Standard Chartered to expand multi-currency fiat rails for institutions via Coinbase Prime, supporting AUD, SGD, CAD, CHF, EUR, and GBP. Meanwhile, Sharplink and Forward Industries, companies holding significant ETH and SOL reserves respectively, are set to be included in the Russell indexes, providing indirect crypto exposure to traditional index investors. Regulatory and compliance moves are in focus. Hong Kong's monetary authority announced new measures for investment accounts of mainland Chinese investors, including retroactive document checks to January 2023. Prediction market Polymarket is considering implementing KYC requirements to address sanctions and legal risks. Major financial players signal deeper involvement. JPMorgan Chase CEO Jamie Dimon suggested the bank might issue a stablecoin in the future. Concurrently, Falcon Finance and Anchorage Digital launched fUSD, a compliant, institution-focused stablecoin. Market sentiment presents a mixed picture. Bitmine's Tom Lee predicts an incoming crypto "supercycle," driven by Wall Street tokenization and AI agents, with Ethereum as a key beneficiary. However, a prominent trader cautions that the current period of investor losses may not be long enough to confirm a bear market bottom, and TD Cowen analysts note diminished chances for U.S. crypto market structure legislation this year due to a worsening political climate. Other notable news includes a16z crypto's observation that most tokenized assets are merely "digitized" and not actively used in DeFi, South Korea's crypto trading volume falling to about 8% of KOSPI's, and the Chinese Supreme Court stating it will research judicial rules for virtual currency cases.

链捕手Há 1h

Morning News | Coinbase Partners with Standard Chartered to Expand Multi-Currency Fiat Channels; Sharplink and Forward to be Included in Russell Indices; JPMorgan May Issue Stablecoin in the Future

链捕手Há 1h

Sitting on a Trillion-Dollar Market, Why Hasn't Real Estate Tokenization Taken Off?

For years, real estate tokenization has been hailed as a breakthrough technology poised to democratize property investment. In theory, it promises fractional ownership of premium assets, rapid transactions, and enhanced liquidity. Yet, in practice, it has failed to gain traction, accounting for less than 0.1% of the global real estate market. The core issue is not a lack of tokens, but the absence of a robust legal, operational, and compliant framework that grants them credibility as financial instruments. The industry initially erred by prioritizing technology over investor needs, creating products with unclear ownership and unreliable liquidity. Key infrastructure remains missing: legally sound ownership structures, compliant transfer mechanisms, professional servicing, and interoperability with traditional finance. This regulatory ambiguity and operational complexity deter institutional investors, who already have access to established, well-governed investment channels. A mature model would feature low minimum investments in institutional-grade assets, transparent rental income distribution, and genuine liquidity through regulated secondary markets. While regulatory progress in regions like the UAE and growth in other tokenized asset sectors (like treasuries) are positive signs, the focus must shift from issuing tokens to building foundational systems. The investment proposition of tokenized real estate is not to create new returns, but to improve access, efficiency, and liquidity for existing income-generating properties. For mainstream adoption, the sector must demonstrate tangible economic advantages over traditional models, not just technical novelty. The next phase depends on proving scalable, compliant operations with auditable track records. The barrier is no longer technology, but infrastructure and regulation. The vision remains unfulfilled until this gap is bridged.

marsbitHá 1h

Sitting on a Trillion-Dollar Market, Why Hasn't Real Estate Tokenization Taken Off?