Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbit2026-05-28 tarihinde yayınlandı2026-05-28 tarihinde güncellendi

Özet

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

İlgili Sorular

QWhat is the core argument of the article regarding current AI models and AGI?

AThe article argues that current large AI models, despite excelling in exams, are getting further from true AGI because they rely on large-scale data approximation rather than genuine understanding. They lack the core abilities of adaptation, causal reasoning, and active experimentation needed for true intelligence.

QAccording to the paper by Michael Timothy Bennett discussed in the article, what is a proposed new definition for AGI?

AMichael Timothy Bennett proposes defining AGI as an 'artificial scientist'—a system that can adapt widely, efficiently, and scientifically to new environments and tasks under real-world constraints (like computation, memory, and energy), just like a human scientist. The focus is on the ability to discover new knowledge, not just mimic humans.

QWhat are the three key behavioral shifts that the 'artificial scientist' framework suggests an AGI must achieve, as outlined in the article?

AThe three key shifts are: 1) From 'passive learner' to 'active experimenter' (planning and executing experiments to gather information). 2) From 'knowing what' to 'knowing why' (understanding causal relationships, not just correlations). 3) Mastering the 'exploration-exploitation trade-off' (dynamically balancing the use of known knowledge and the search for new knowledge under resource constraints).

QWhat are the three 'meta-methods' for building intelligent systems that the article describes, and what is the proposed path to AGI?

AThe three meta-methods are: 1) Scale-maxing (maximizing data, parameters, compute). 2) Simp-maxing (maximizing architectural simplicity). 3) W-maxing (maximizing constraint weakening to find optimal solutions). The article argues that AGI will not be achieved through a single meta-method like Scale-maxing alone, but rather through the integration of multiple, diverse approaches.

QHow would the acceptance of the 'artificial scientist' definition change how we evaluate AI systems, according to the article?

AThe focus would shift from evaluating based on performance on human-centric benchmarks and exams to creating new 'adaptability benchmarks.' These would test an AI's ability to discover patterns in unseen physical environments, learn new game rules faster than humans, or solve real scientific problems by autonomously forming and testing hypotheses. The core metric becomes 'how much can you discover,' not 'how much do you know.'

İlgili Okumalar

Annual Salary of Millions Competing for Electricians, Meta Rushes to Open Its Own Technical School

The AI boom is facing an unexpected bottleneck: a severe shortage of skilled construction workers and electricians. As tech giants like Meta, OpenAI, and Alphabet race to build massive data centers—such as OpenAI's $16 billion "Stargate" project—they are hitting a critical labor wall. The U.S. needs an estimated 130,000 more electricians, 240,000 construction workers, and 150,000 supervisors by 2030 for AI infrastructure alone, but tens of thousands of electrician jobs go unfilled each year. While AI companies offer high premiums, with electricians earning up to $280,000 annually, worker scarcity still causes massive losses—delays on a single project can cost $14.2 million per month. The complexity of building AI data centers, which require immense power (equivalent to powering hundreds of thousands of homes), sophisticated electrical systems, and advanced liquid cooling solutions, demands highly skilled technicians who are in short supply. To combat this, companies are investing heavily in training. Meta has committed $115 million to a free training school offering tuition, housing, and stipends, targeting 5,000 new workers. OpenAI is partnering with unions to secure skilled labor. These efforts are paying off, with a significant rise in Gen Z interest in trade schools over college. However, the power demands are staggering. AI data centers are driving a rapid surge in electricity consumption, projected to account for up to 12% of U.S. power use by 2028 and raising costs for consumers. Furthermore, the construction boom is project-based, leading to a potential future glut of trained workers once building peaks, which could depress wages industry-wide. The race for AI supremacy now depends as much on skilled hands as on advanced chips.

marsbit26 dk önce

Annual Salary of Millions Competing for Electricians, Meta Rushes to Open Its Own Technical School

marsbit26 dk önce

OpenAI No Longer Sells Its Most Expensive Model for Profit

OpenAI is shifting its business strategy away from promoting its most expensive, flagship models for every task. Recent price cuts—80% for GPT-5.6 Luna and 20% for Terra—signal a deeper change: the company now actively advises users that many tasks don't require the most powerful model. Instead, OpenAI recommends a tiered approach: use the high-end GPT-5.6 Sol for complex planning and analysis, then delegate execution to cheaper models like Luna. This mirrors moves by Anthropic, which recently launched Claude Opus 5 at half the price of its top model, Fable 5. Both companies are de-emphasizing flagship models as primary revenue drivers, using them instead for brand prestige and technological showcases. The industry is entering a "mass-market" phase, similar to automotive, where high-volume, cost-effective models handle daily operations and drive scale. OpenAI's price reductions are partly enabled by AI models themselves optimizing underlying code and infrastructure, creating a self-reinforcing cycle of efficiency gains and cost reduction. Competition is shifting from "who is smartest" to "who offers the best value." The goal is no longer selling individual models but fostering widespread API adoption and ecosystem lock-in. By making AI calls cheap and ubiquitous, companies like OpenAI aim to become the indispensable, utility-like infrastructure powering automated workflows—the "water and electricity" of software, quietly embedded everywhere.

marsbit26 dk önce

OpenAI No Longer Sells Its Most Expensive Model for Profit

marsbit26 dk önce

Suspected 4th Coldcard attack wave sweeps 389 Bitcoin: Galaxy’s Thorn

Coldcard hardware wallet users are facing a new wave of coordinated attacks targeting a firmware flaw, with researcher Alex Thorn flagging 218 recent transactions moving approximately 389 Bitcoin from potentially impacted addresses. The attack pattern shows a high rate of transactions targeting unique victim addresses, differing from previous waves. The vulnerability, which causes affected devices to generate weaker wallet seeds, is estimated to have impacted over 1,100 wallets, leading to around $90 million in Bitcoin stolen. Thorn advises affected users who control their keys may attempt to move funds with a higher-fee transaction before the attacker's transactions are confirmed.

cointelegraph33 dk önce

Suspected 4th Coldcard attack wave sweeps 389 Bitcoin: Galaxy’s Thorn

cointelegraph33 dk önce

Bitcoin Miners Are Waving the White Flag, But Their Stocks Are Soaring

Bitcoin miners are capitulating as evidenced by a sustained drop in network hash rate and a record-steep 19.9% decline in mining difficulty, signaling the shuttering of unprofitable machines. However, in a significant divergence from historical patterns, the stocks of publicly traded mining companies have soared, with one major player gaining over 430% in the past year, even as BTC's price fell roughly 46%. This surge is largely attributed to these companies pivoting toward the more lucrative AI narrative. Simultaneously, miners face a structural squeeze. Daily block reward revenue in BTC terms has hit a new all-time low following the latest halving, with current dollar-denominated daily revenue around $30 million compared to a longer-term average of ~$40 million. Fee revenue remains negligible, covering less than one block's subsidy over a 28-day average and accounting for only about ten minutes of the network's daily security budget. This capitulation cycle is unique: miner stress and crypto price weakness have decoupled due to alternative revenue streams (AI), while the long-term reliance on increasing bitcoin prices to offset shrinking subsidies continues, with fee income still far from filling the impending gap.

marsbit37 dk önce

Bitcoin Miners Are Waving the White Flag, But Their Stocks Are Soaring

marsbit37 dk önce

Will the Fed Definitely Raise Interest Rates in September? How Will Crypto and U.S. Stocks Withstand the Pressure?

The market's expectation for a September Fed rate hike surged dramatically in early August, jumping from under 50% to over 80% within a week. This shift followed a contentious July FOMC meeting, where a 9-3 vote to hold rates revealed growing dissent from hawkish members advocating for an immediate hike to combat persistent inflation. The primary catalyst for this repricing is rising oil prices, driven by renewed geopolitical tensions around the Strait of Hormuz, which threaten global supply. Energy costs directly influence inflation metrics, making the upcoming July CPI report (due August 12th) a critical data point. If it shows inflation reaccelerating, the probability of a September hike will solidify. For Bitcoin and crypto assets, this is typically bearish news. Bitcoin continues to behave as a high-beta, liquidity-sensitive risk asset. A rate hike raises the opportunity cost of holding non-yielding assets and could drive capital toward money markets, pressuring crypto prices in the short term. However, historical patterns suggest that if a hike is perceived as the end of a tightening cycle rather than the start, any negative price impact may be brief. U.S. stocks, particularly crypto-linked equities like Coinbase and growth-oriented tech stocks, are also vulnerable. Higher rates increase discount rates in valuation models, putting pressure on high-multiple companies. This coincides with a pivotal tech earnings season where investor focus has shifted from massive AI capital expenditure to tangible revenue and cash flow generation. Companies with negative cash flow and weak growth narratives could face heightened volatility if borrowing costs rise in September. In summary, a September Fed hike has evolved into a mainstream market scenario. Key factors to watch are oil prices, the July CPI report, and Fed communications, which will determine the final decision and its impact on volatile crypto and equity markets.

marsbit37 dk önce

Will the Fed Definitely Raise Interest Rates in September? How Will Crypto and U.S. Stocks Withstand the Pressure?

marsbit37 dk önce

İşlemler

Spot