Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbitPublicado em 2026-05-28Última atualização em 2026-05-28

Resumo

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

Perguntas relacionadas

QWhat is the core argument of the article regarding current AI models and AGI?

AThe article argues that current large AI models, despite excelling in exams, are getting further from true AGI because they rely on large-scale data approximation rather than genuine understanding. They lack the core abilities of adaptation, causal reasoning, and active experimentation needed for true intelligence.

QAccording to the paper by Michael Timothy Bennett discussed in the article, what is a proposed new definition for AGI?

AMichael Timothy Bennett proposes defining AGI as an 'artificial scientist'—a system that can adapt widely, efficiently, and scientifically to new environments and tasks under real-world constraints (like computation, memory, and energy), just like a human scientist. The focus is on the ability to discover new knowledge, not just mimic humans.

QWhat are the three key behavioral shifts that the 'artificial scientist' framework suggests an AGI must achieve, as outlined in the article?

AThe three key shifts are: 1) From 'passive learner' to 'active experimenter' (planning and executing experiments to gather information). 2) From 'knowing what' to 'knowing why' (understanding causal relationships, not just correlations). 3) Mastering the 'exploration-exploitation trade-off' (dynamically balancing the use of known knowledge and the search for new knowledge under resource constraints).

QWhat are the three 'meta-methods' for building intelligent systems that the article describes, and what is the proposed path to AGI?

AThe three meta-methods are: 1) Scale-maxing (maximizing data, parameters, compute). 2) Simp-maxing (maximizing architectural simplicity). 3) W-maxing (maximizing constraint weakening to find optimal solutions). The article argues that AGI will not be achieved through a single meta-method like Scale-maxing alone, but rather through the integration of multiple, diverse approaches.

QHow would the acceptance of the 'artificial scientist' definition change how we evaluate AI systems, according to the article?

AThe focus would shift from evaluating based on performance on human-centric benchmarks and exams to creating new 'adaptability benchmarks.' These would test an AI's ability to discover patterns in unseen physical environments, learn new game rules faster than humans, or solve real scientific problems by autonomously forming and testing hypotheses. The core metric becomes 'how much can you discover,' not 'how much do you know.'

Leituras Relacionadas

From Tokens to Machine Labor: AI is Shifting from Tool to "Worker"

The article "From Token to Machine Labor: AI is Evolving from Tool to 'Worker'" argues that the business model for AI is shifting beyond simply selling computational resources (tokens, GPU hours) or model access. Instead, a new "machine labor market" is emerging, where the core economic transaction is the purchase of economically useful work directly performed by software. The central thesis is that AI pricing will evolve through four stages: 1) raw tokens, 2) standardized LLM capabilities (e.g., text generation), 3) industry-specific labor markets (e.g., legal review, radiology), and finally 4) a programmable results market where tasks like resolving a support ticket are bid on and priced based on outcome. In this future, buyers will care less about *which* model or GPU completes a task and more about whether the work meets specified standards for accuracy, latency, and cost. This transition reframes the impact of AI on human labor. Rather than simple replacement, it suggests a re-coordination where machines handle standardized, verifiable work, freeing humans for roles involving oversight, context management, responsibility, and final judgment. In some cases, this "last 1%" of human input becomes more valuable as it enables the other 99% to be automated. Furthermore, as AI reduces the cost of work, demand may expand, creating larger markets (e.g., 24/7 customer service) rather than just cheaper versions of existing ones. The article concludes that while infrastructure (GPUs, models, tokens) remains crucial upstream, the market is converging on a simpler, tradeable unit: machine labor that can be defined, measured, priced, and procured based on contractible specifications.

marsbitHá 9m

From Tokens to Machine Labor: AI is Shifting from Tool to "Worker"

marsbitHá 9m

Xiaomi MiMo's 99% Price Cut is Not Marketing! Luo Fuli Posts on X to Refute Critics

The price of Xiaomi's MiMo-V2.5 series API has been permanently reduced by up to 99%, specifically for the "Input (Cache Hit)" cost, which covers users re-reading historical context in long conversations. MiMo's head, Luo Fuli, published a detailed technical blog to clarify that this drastic price cut stems from genuine engineering breakthroughs, not a marketing stunt or a simple price war. The core of the achievement lies in six key engineering optimizations. First, the model architecture adopts a Hybrid Sliding Window Attention (SWA), reducing the memory footprint (KVCache) to 1/7th of a traditional model. Second, a dual-pool memory management system actually utilizes these savings, allowing a single GPU to handle over 5 times more concurrent users. Third, an upgraded prefix caching mechanism achieves a cache hit rate of 93-95% for repeated reads, meaning most such requests bypass GPU computation entirely. Fourth, a self-developed distributed cache (GCache) utilizes idle SSD space on existing GPU servers, eliminating additional storage costs. Fifth, an intelligent scheduling system (LLM-Router) efficiently routes requests to maximize cache reuse and performance. Sixth, Multi-Token Prediction (MTP) accelerates the model's text generation ("output") side. Together, these systemic optimizations dramatically lower the real computational cost per request, enabling the 99% price reduction for cached inputs while reportedly maintaining positive gross margins. Luo Fuli's disclosure aims to shift the narrative from "price war" to a demonstration of substantive AI engineering progress.

marsbitHá 2h

Xiaomi MiMo's 99% Price Cut is Not Marketing! Luo Fuli Posts on X to Refute Critics

marsbitHá 2h

$26 Billion: An 'All-Chinese Team' Backs the World's Highest-Valued AI Programming Company

Cognition AI, the company behind the AI programmer "Devin," has raised over $1 billion in new funding at a valuation of $26 billion, just eight months after reaching a $10.2 billion valuation. The round was led by Lux Capital, General Catalyst, and 8VC. Founded by three young Chinese entrepreneurs with strong competitive programming backgrounds, Cognition initially gained fame with Devin, marketed as the world's first AI software engineer capable of handling tasks from start to finish. While its early demos were impressive, real-world usage revealed reliability and cost-effectiveness issues, leading to a significant price cut for Devin in 2025. A pivotal moment came when Cognition acquired the assets of AI IDE company Windsurf after a failed acquisition by OpenAI. This move gave Cognition a crucial developer-facing tool, allowing it to pursue a two-pronged strategy: Devin for autonomous task execution and Windsurf for integrated, collaborative coding within an IDE. This shift helped the company move away from the controversial "AI replacement" narrative towards a model of augmenting human engineers, particularly for repetitive or maintenance tasks. This strategic pivot is backed by strong commercial metrics. The company reports a 10x increase in enterprise usage this year, with an annual revenue run-rate of $492 million and a 50% month-over-month growth in enterprise Devin usage over the past six months. Its client list now includes major corporations like Goldman Sachs and Mercedes-Benz, as well as government agencies like NASA and the U.S. Army. Investors are betting on Cognition becoming a foundational piece of next-generation software engineering infrastructure, positioning it at the center of a hybrid future where AI agents and human developers work in tandem.

marsbitHá 2h

$26 Billion: An 'All-Chinese Team' Backs the World's Highest-Valued AI Programming Company

marsbitHá 2h

The Hottest 00s Generation on Wall Street

"Wall Street's Hottest '00s Phenom: The 25-Year-Old Fund Manager Who Bet on AI's 'Boring' Backbone" At just 25, Leopold Aschenbrenner, once fired by OpenAI, now runs a hedge fund worth $13.7 billion. His strategy? Betting against the consensus. While others chased AI chips, he invested early in the physical infrastructure powering the AI boom: electricity, data centers, and energy. Expelled from OpenAI's safety team in 2024, Aschenbrenner foresaw the coming bottleneck. He argued that AI progress would be limited not by algorithms, but by power, chip capacity, and space. Acting on this, he founded Situational Awareness LP to go long on these "old economy" assets. His bets have paid off spectacularly. His fund's assets soared from $255 million in late 2024 to $13.7 billion by Q1 2026. His portfolio is a direct reflection of his thesis: major long positions in fuel cell company Bloom Energy and data center/bitcoin mining firms like CleanSpark and Riot Platforms, which control critical land and power resources. Conversely, he holds massive put options against overheated semiconductor giants like NVIDIA and AMD. A notable exception was his bullish bet on storage company SanDisk, which surged ~160% in Q2. Aschenbrenner's vision is materializing. Tech giants like Amazon, Alphabet, and Meta are ramping up colossal capital expenditure on data centers. Global data center power consumption is projected to skyrocket, with AI accounting for over half by 2030. The demand for enabling technologies like optical fiber and modules is also exploding. His story underscores a fundamental truth of the AI era: the ethereal intelligence of algorithms rests on a very physical, heavy, and power-hungry foundation. The future is being built not just in code, but in concrete, copper, and kilowatts.

marsbitHá 4h

The Hottest 00s Generation on Wall Street

marsbitHá 4h

Review of Cathie Wood's Masterstroke Operation on Circle

A Recap of Cathie Wood's Masterful Trading in Circle's IPO This article analyzes the strategic moves made by ARK Invest's Cathie Wood around the IPO of Circle (CRCL). Despite her typical long-term, narrative-driven investment style, Wood executed a textbook "buy low, sell high" trade. Wood secured a core position of approximately 4.49 million shares at the $31 IPO price. The stock debuted at $69, surged to a high of $299 in June 2025 fueled by stablecoin regulatory news (the GENIUS Act), and then entered a prolonged decline. During this rally, ARK systematically sold around 1.7 million shares at an average price near $210, driven partly by internal fund rebalancing rules triggered by the stock's soaring weight. This move locked in substantial profits. As the stock later fell due to lockup expirations, new share issuance, and interest rate concerns—even dipping below $50—Wood began repurchasing shares. Starting in November 2025 around $86, she continued buying on the way down, eventually rebuilding her position to roughly the original size by Q1 2026. Key takeaways include: 1) Having a strong, independent long-term thesis (viewing Circle as critical digital dollar infrastructure). 2) Trading in tranches instead of trying to time exact tops or bottoms. 3) Maintaining strict position-sizing discipline, using rules to force profit-taking and preserve buying power. For most retail investors, chasing the dramatic "pop" at open is dangerous, as the subsequent 83% drawdown showed. Wood's success hinged on pre-IPO access, a clear investment thesis, and disciplined execution.

marsbitHá 6h

Review of Cathie Wood's Masterstroke Operation on Circle

marsbitHá 6h

Trading

Spot
Futuros
活动图片