Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbitОпубліковано о 2026-05-28Востаннє оновлено о 2026-05-28

Анотація

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

Пов'язані питання

QWhat is the core argument of the article regarding current AI models and AGI?

AThe article argues that current large AI models, despite excelling in exams, are getting further from true AGI because they rely on large-scale data approximation rather than genuine understanding. They lack the core abilities of adaptation, causal reasoning, and active experimentation needed for true intelligence.

QAccording to the paper by Michael Timothy Bennett discussed in the article, what is a proposed new definition for AGI?

AMichael Timothy Bennett proposes defining AGI as an 'artificial scientist'—a system that can adapt widely, efficiently, and scientifically to new environments and tasks under real-world constraints (like computation, memory, and energy), just like a human scientist. The focus is on the ability to discover new knowledge, not just mimic humans.

QWhat are the three key behavioral shifts that the 'artificial scientist' framework suggests an AGI must achieve, as outlined in the article?

AThe three key shifts are: 1) From 'passive learner' to 'active experimenter' (planning and executing experiments to gather information). 2) From 'knowing what' to 'knowing why' (understanding causal relationships, not just correlations). 3) Mastering the 'exploration-exploitation trade-off' (dynamically balancing the use of known knowledge and the search for new knowledge under resource constraints).

QWhat are the three 'meta-methods' for building intelligent systems that the article describes, and what is the proposed path to AGI?

AThe three meta-methods are: 1) Scale-maxing (maximizing data, parameters, compute). 2) Simp-maxing (maximizing architectural simplicity). 3) W-maxing (maximizing constraint weakening to find optimal solutions). The article argues that AGI will not be achieved through a single meta-method like Scale-maxing alone, but rather through the integration of multiple, diverse approaches.

QHow would the acceptance of the 'artificial scientist' definition change how we evaluate AI systems, according to the article?

AThe focus would shift from evaluating based on performance on human-centric benchmarks and exams to creating new 'adaptability benchmarks.' These would test an AI's ability to discover patterns in unseen physical environments, learn new game rules faster than humans, or solve real scientific problems by autonomously forming and testing hypotheses. The core metric becomes 'how much can you discover,' not 'how much do you know.'

Пов'язані матеріали

Deconstructing the U.S. Stock Quantum Computing Sector: IonQ, Rigetti, D-Wave, Which of These Concept Stocks is Worth Betting On?

**Title:** Analyzing the US Quantum Computing Race: IonQ, Rigetti, D-Wave – Which Concept Stock is Worth Betting On? **Summary:** The podcast discusses the resurgence of quantum computing as a national priority for both the US and China, driven by its potential to break current encryption, revolutionize drug discovery, finance, and logistics. The core challenge is commercializing the technology, which is hampered by high error rates in quantum bits (qubits). Quantum error correction, requiring thousands of physical qubits per reliable logical qubit, is key but years away. The analysis compares three main publicly traded US quantum computing firms: * **IonQ (Ion Trap):** Considered the most financially stable with the fastest commercial progress (2025 revenue: $130M, +202%) and high-quality clients. Its valuation is very high, pricing in significant future growth. * **Rigetti (Superconducting):** Seen as the highest-risk, highest-potential-reward bet. It has the smallest revenue but recently launched a 108-qubit system. Its valuation multiples are extreme, making it highly sensitive to news. * **D-Wave (Quantum Annealing):** Has the most unique positioning with real-world enterprise clients today (e.g., Mastercard, Volkswagen) solving optimization problems. Its recent acquisition moves it into general-purpose quantum computing ("dual-platform"), adding execution risk. Major tech giants like Google, IBM, and Microsoft are also heavily invested, pursuing various technical approaches. Nvidia is positioning itself as the essential bridge between classical and quantum computing. The investment phase is likened to AI in 2018-2020: promising underlying technology with accelerating breakthroughs but a commercial inflection point still 3-7 years away, suggesting potential for a market correction ("bubble washout"). For investors, suggested approaches include gaining exposure through tech giants with quantum divisions (e.g., Google, IBM) or using niche ETFs like WQTM for pure-play quantum exposure, rather than direct stock picks in the highly volatile pure-play companies at this early stage.

marsbit8 хв тому

Deconstructing the U.S. Stock Quantum Computing Sector: IonQ, Rigetti, D-Wave, Which of These Concept Stocks is Worth Betting On?

marsbit8 хв тому

From Parallel Finance to Mainstream Finance: The On-Chain Securities Era Ushers in a Historic Window

From Parallel Finance to Mainstream: The Dawn of On-Chain Securities For over a decade, the crypto industry has operated as a parallel financial system with its own currencies, markets, and assets—from Bitcoin and ICOs to DeFi, NFTs, and memecoins. Despite building a robust internal ecosystem, a wall has separated it from the traditional financial world. That barrier is now crumbling. The industry's first act was one of internal evolution: ICOs streamlined fundraising, DeFi recreated financial services on-chain, and layer-2 networks competed for scalability—all within the crypto bubble. While innovative, this cycle remained closed, with capital and users circulating internally, leading to volatile boom-bust cycles. Even Bitcoin ETFs, while attracting Wall Street capital, merely provided a channel to buy crypto assets without bridging the systems. The next, larger narrative is Real-World Assets (RWA) moving on-chain. This involves tokenizing stocks, bonds, funds, and future cash flows. Blockchain can compress the complex traditional processes of trading, settlement, clearing, and custody into a seamless, automated network operating in seconds. This shift is creating a new financial gateway: the native crypto securities broker. This entity will combine functions of an exchange, broker, bank, and custodian into a unified global financial operating system. Consequently, the next major battleground won't be the "public chain wars" focused on speed and cost, but the competition to build the financial infrastructure capable of hosting high-quality, liquid real-world assets. Access to global equities, index funds, or stakes in companies like SpaceX could erase the boundary between crypto and traditional finance, unlocking a market orders of magnitude larger than crypto's current valuation. In summary, after years of creating a separate financial world, crypto's next decade will be defined by its integration into the existing global financial system, marking the true beginning of its largest growth story.

marsbit30 хв тому

From Parallel Finance to Mainstream Finance: The On-Chain Securities Era Ushers in a Historic Window

marsbit30 хв тому

Wang Chuan: When the Neighbor Old Wang Made 30x on Memory Stocks, How to Avoid Anxiety (Part Six) - The Trap of Commoditized Goods

Wang Chuan: When the Neighbor Lao Wang Made 30x on Storage Stocks, How to Stay Anxiety-Free (Part 6) - The Trap of Commoditized Goods. This essay uses historical and current examples to analyze the cyclical and high-risk nature of the data storage industry. It begins with the 1990s rise and dramatic fall of Iomega, whose stock soared over 160x in 18 months before collapsing 97% from its peak, illustrating the fleeting success of storage "meme stocks." The core problem is that storage products, like DRAM and flash memory, are highly commoditized. This leads to extreme volatility: prices have plummeted over 80% multiple times, and company stocks often crash 95% or go bankrupt. The industry's dynamic is defined by "elastic demand facing heavy-asset, long-cycle, rigid supply." When demand spikes and supply is fixed, prices skyrocket, as seen recently with AI-driven demand for High Bandwidth Memory (HBM). Companies like Sandisk and Micron have reported massive revenue and gross margin jumps (e.g., Sandisk's gross margin rising from 22.5% to 78.3%) despite minimal increases in production volume. However, these high margins are self-defeating. They incentivize massive new capacity investments (hundreds of billions planned from 2026), with supply expected to surge by late 2027. Once new supply meets demand, prices and profits will crash, potentially leading to a scenario where "selling more results in earning less." The article debunks the safety of long-term supply agreements, comparing them to fragile non-aggression pacts easily broken when market conditions shift. It warns that when an industry is highly profitable but trades at low P/E ratios, the risk is greatest, as plummeting prices quickly erase those earnings. Multiple asymmetric risks loom, including economic recession, reduced AI spending, faster-than-expected capacity expansion (especially from Chinese firms), and technological innovations that reduce memory requirements. In conclusion, the storage sector is a cyclical trap where periods of euphoric profits are often precursors to devastating downturns, luring unprepared investors into a "wealth incinerator."

marsbit39 хв тому

Wang Chuan: When the Neighbor Old Wang Made 30x on Memory Stocks, How to Avoid Anxiety (Part Six) - The Trap of Commoditized Goods

marsbit39 хв тому

Wang Chuan: When the neighbor Lao Wang earned thirty times from investing in memory storage stocks, how can you still avoid anxiety (6) - The trap of homogeneous products

The article, "Wang Chuan: How to Remain Unanxious After Neighbor Lao Wang's Thirty-Fold Gain on Storage Stocks (Part 6) - The Trap of Commoditized Goods," analyzes the cyclical and perilous nature of the data storage industry through historical and current case studies. It begins with the example of Iomega, whose Zip drives led to a stock surge of over 160x in the mid-1990s before collapsing over 97% from its peak due to competition from cheaper CD-R technology. This pattern is characteristic of storage, where products like DRAM are highly commoditized, leading to extreme price volatility. The sector has seen prices crash over 80% multiple times, with companies often facing bankruptcy. The core dynamic is "elastic demand facing heavy-asset, long-cycle, rigid supply." High prices attract new capacity, but the long lead time means supply eventually overshoots, causing sharp price corrections. The current AI-driven boom, exemplified by surging demand for High-Bandwidth Memory (HBM), has led to skyrocketing prices and profit margins for companies like SanDisk and Micron, despite relatively flat production volumes. However, the author warns this high-margin environment is self-defeating. The high profits are already triggering massive new capacity investments (hundreds of billions starting 2026), with supply expected to ramp up by late 2027. When supply catches up, total revenue and profits may fall even as more units are sold. Long-term supply agreements offer little protection, as buyers can find ways to renegotiate if market prices drop, similar to fragile political treaties. Key risks include economic downturns, cuts in AI spending, faster-than-expected capacity expansion (especially from Chinese firms), and innovations in chip/algorithm design that reduce memory needs. A critical trap is that at the cycle's peak, storage stocks often appear cheap with low P/E ratios, luring value investors just before an impending downturn where profits evaporate. The conclusion cautions that for commoditized goods like storage, high margins inevitably destroy themselves, and the current asymmetry favors downside risk over further upside. The neighbor's dream of easy wealth from storage stocks is portrayed as a precarious illusion.

链捕手57 хв тому

Wang Chuan: When the neighbor Lao Wang earned thirty times from investing in memory storage stocks, how can you still avoid anxiety (6) - The trap of homogeneous products

链捕手57 хв тому

Торгівля

Спот
Ф'ючерси
活动图片