Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbitPublicado em 2026-05-28Última atualização em 2026-05-28

Resumo

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

Perguntas relacionadas

QWhat is the core argument of the article regarding current AI models and AGI?

AThe article argues that current large AI models, despite excelling in exams, are getting further from true AGI because they rely on large-scale data approximation rather than genuine understanding. They lack the core abilities of adaptation, causal reasoning, and active experimentation needed for true intelligence.

QAccording to the paper by Michael Timothy Bennett discussed in the article, what is a proposed new definition for AGI?

AMichael Timothy Bennett proposes defining AGI as an 'artificial scientist'—a system that can adapt widely, efficiently, and scientifically to new environments and tasks under real-world constraints (like computation, memory, and energy), just like a human scientist. The focus is on the ability to discover new knowledge, not just mimic humans.

QWhat are the three key behavioral shifts that the 'artificial scientist' framework suggests an AGI must achieve, as outlined in the article?

AThe three key shifts are: 1) From 'passive learner' to 'active experimenter' (planning and executing experiments to gather information). 2) From 'knowing what' to 'knowing why' (understanding causal relationships, not just correlations). 3) Mastering the 'exploration-exploitation trade-off' (dynamically balancing the use of known knowledge and the search for new knowledge under resource constraints).

QWhat are the three 'meta-methods' for building intelligent systems that the article describes, and what is the proposed path to AGI?

AThe three meta-methods are: 1) Scale-maxing (maximizing data, parameters, compute). 2) Simp-maxing (maximizing architectural simplicity). 3) W-maxing (maximizing constraint weakening to find optimal solutions). The article argues that AGI will not be achieved through a single meta-method like Scale-maxing alone, but rather through the integration of multiple, diverse approaches.

QHow would the acceptance of the 'artificial scientist' definition change how we evaluate AI systems, according to the article?

AThe focus would shift from evaluating based on performance on human-centric benchmarks and exams to creating new 'adaptability benchmarks.' These would test an AI's ability to discover patterns in unseen physical environments, learn new game rules faster than humans, or solve real scientific problems by autonomously forming and testing hypotheses. The core metric becomes 'how much can you discover,' not 'how much do you know.'

Leituras Relacionadas

Explosive Growth in Trading Volumes of 15 Altcoins Observed in South Korea!

Major South Korean cryptocurrency exchanges Upbit and Bithumb have reported a significant surge in trading volumes for several altcoins. Over the past 24 hours, the total trading volume for the most popular altcoins reached approximately $347.7 million. MetaDAO (META) led the rankings with a trading volume of $65.84 million on Upbit alone, accounting for 12.39% of the exchange's total spot volume. Euler (EUL) followed in second place with a total volume of $47.65 million across both exchanges. XRP, which consistently attracts substantial interest from Korean investors, achieved a total volume of $38.11 million. Other notable altcoins in the top 15 by trading volume include ThunderCore (TT) at $35.64 million, Babylon (BABY) at $25.15 million, and Shiba Inu (SHIB) at $10.55 million.

cryptonews.ruHá 1h

Explosive Growth in Trading Volumes of 15 Altcoins Observed in South Korea!

cryptonews.ruHá 1h

Donald Trump's Company Sold Another Large Batch of Bitcoins!

Donald Trump's company, Trump Media & Technology Group, reportedly transferred another large batch of Bitcoin to the CryptoCom exchange. Blockchain analysis indicates that addresses linked to Trump Media moved approximately 2,628 BTC (worth around $165 million) to the exchange. Prior reports suggested the company had acquired a total of 11,542 BTC at an average price of $118,500. It is claimed that by 2026, about 7,281 BTC had been withdrawn from these addresses, with approximately 4,261 BTC still held on them. The total realized and unrealized losses from Trump Media's Bitcoin investments are estimated to be roughly $555 million. It is important to note that sending Bitcoin to an exchange does not definitively mean the assets were sold. Such transfers could also be for custody, liquidity management, or other financial operations. However, movements from cold wallets to centralized exchanges are commonly viewed as potential sales activity.

cryptonews.ruHá 3h

Donald Trump's Company Sold Another Large Batch of Bitcoins!

cryptonews.ruHá 3h

Parker Lewis Explains Why Bitcoin Remains the Best Money

Bitcoin analyst Parker Lewis criticized companies promoting themselves as "crypto treasuries" for selling perpetual preferred stock, calling it a distortion of Bitcoin's essence. He argues Bitcoin has no inherent yield, and promises of dividends from such corporate derivatives are risky, often relying on new investor inflows. Lewis highlighted the vast discrepancy between the $300 trillion global credit market and the $1 trillion perpetual preferred stock market, suggesting these instruments shift indefinite risks to retail investors. He also refuted the notion that Bitcoin is "too volatile," stating volatility is a natural mathematical outcome of a fixed-supply asset gaining mass adoption, as new users must bid higher to acquire it. Instead of buying shares of companies like MicroStrategy, Lewis advises direct Bitcoin ownership as safer. The focus on corporate derivatives distracts from the primary threat of fiat currency devaluation. Citing his informal "Ribeye Index," Lewis notes a steep rise in steak prices, indicating real inflation far exceeding official CPI figures. In conclusion, the most prudent strategy against inflation is direct ownership and self-custody of Bitcoin. Chasing corporate yield through crypto treasury stocks multiplies systemic risks, while understanding decentralized money protects savings from macroeconomic turmoil.

cryptonews.ruHá 3h

Parker Lewis Explains Why Bitcoin Remains the Best Money

cryptonews.ruHá 3h

Why Bitcoin Holds Above $64,000 After Fed's Hard Pause

**Bitcoin Stabilizes Near $64,000 Following Hawkish Fed Pause** The cryptocurrency market, led by Bitcoin, remained stable around $64,000 despite a volatile reaction to the latest U.S. Federal Reserve meeting. The Fed paused interest rates but signaled a hawkish stance, with three committee members voting for an increase—the highest dissent since 2016. This limits risk appetite but hasn't triggered panic selling. Key market highlights include Bitcoin ETFs seeing a net inflow of $32.1 million, breaking a streak of outflows, while Ethereum ETFs experienced outflows of $18.65 million. Liquidations affected about 90,000 traders. Technically, Bitcoin finds support around $63,000-$63,500, with major resistance near $66,000. While its price is about 49% below its all-time high, institutional demand via ETFs and the absence of mass capitulation support a potential recovery scenario in the second half of the year. Major altcoins showed mixed movements, with Solana attracting capital while Ethereum faced selling pressure despite strong on-chain metrics like a growing staking queue. Regulatory news took a pause as the U.S. Senate delayed the CLARITY Act vote until at least autumn. For the final trading day of July, U.S. inflation and consumer spending data will be crucial. Bitcoin's key levels to watch are $63,000 support and $66,000 resistance. Sustained ETF inflows and Bitcoin holding above $63,000 are seen as positive signs for a potential market recovery later in the year.

cryptonews.ruHá 3h

Why Bitcoin Holds Above $64,000 After Fed's Hard Pause

cryptonews.ruHá 3h

ARK Invest's Cathie Wood Buys 109,129 Circle Shares Worth $6.83 Million

ARK Invest, led by Cathie Wood, purchased approximately 109,129 shares of Circle for nearly $6.83 million across three of its ETFs: ARK Innovation, ARK Next Generation Internet, and ARK Fintech Innovation. This investment followed Circle's recent receipt of a trust charter license from the New York Department of Financial Services for its subsidiary, Circle New York Trust, which CEO Jeremy Allaire described as a long-term company goal. Despite this regulatory approval, Circle's stock (CRCL) fell 2.54% to $62.61 on July 31, as investors may not have viewed the license as a catalyst for growth. In the same period, ARK Invest also bought shares in Tesla, SpaceX, and Nvidia worth about $40.2 million amid a broader tech sell-off, while reducing its holdings in companies like Shopify, Cloudflare, and CrowdStrike.

cryptonews.ruHá 3h

ARK Invest's Cathie Wood Buys 109,129 Circle Shares Worth $6.83 Million

cryptonews.ruHá 3h

Trading

Spot

Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

Resumo

2025, Who is Redrawing the Starting Line for AGI?

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Perguntas relacionadas

Leituras Relacionadas

Explosive Growth in Trading Volumes of 15 Altcoins Observed in South Korea!

Donald Trump's Company Sold Another Large Batch of Bitcoins!

Parker Lewis Explains Why Bitcoin Remains the Best Money

Why Bitcoin Holds Above $64,000 After Fed's Hard Pause

ARK Invest's Cathie Wood Buys 109,129 Circle Shares Worth $6.83 Million

Trading

Categorias populares

Etiquetas Populares