Large Language Models Ace All Exams, Yet Move Farther from AGI: What Does This Paper Reveal?

marsbitPublished on 2026-05-28Last updated on 2026-05-28

Abstract

The article discusses the ongoing challenge of defining and achieving Artificial General Intelligence (AGI). It notes that industry leaders have set vague, often profit- or time-based benchmarks for AGI, while the concept itself lacks a consensus definition—a situation the article compares to a "Rorschach test." It highlights a recent 2025 paper by researcher Michael Timothy Bennett, who proposes a new, measurable definition. Bennett frames AGI not as mimicking human performance on tests, which current large language models (LLMs) have already mastered, but as an "artificial scientist." A true AGI, according to this view, should be able to widely and efficiently adapt to new environments and tasks within real-world constraints (like computational and energy limits), focusing on the *discovery of new knowledge* rather than the replication of existing data. The author contrasts this with the current dominant approach of "scale-maxing"—massively scaling up data, parameters, and compute. While powerful, this method leads to models that fail on out-of-distribution problems and lack core intelligent abilities: they are passive learners, cannot reason causally, and cannot actively experiment or balance exploration with exploitation. The article argues that Bennett's framework offers a crucial shift. It makes AGI a quantifiable engineering problem and proposes new evaluation "adaptation benchmarks" that test an AI's ability to actively learn in novel scenarios. The conclusion is t...

If someone tells you that AGI (Artificial General Intelligence) has been achieved, how would you determine if they are telling the truth or just bragging?

In the secret agreement exposed between OpenAI and Microsoft, the yardstick is financial statements—an AI system capable of generating at least $100 billion in profit qualifies as AGI. In Jensen Huang's words, the yardstick is time—inevitable within five years; Musk has repeatedly predicted "achieving it next year."

The fact that industry leaders speak different languages doesn't stem from anyone lying, but from the absence of a universally accepted yardstick for AGI itself. As noted by a researcher with independent thinking in the AGI field, Bennett, in his paper, AGI has been reduced by hype and speculation to a "Rorschach inkblot test"—everyone sees only their own imagination, not objective facts; Santa Fe Institute scientist Melanie Mitchell also believes this debate can only be clarified through long-term scientific research. (Paper link: https://arxiv.org/pdf/2503.23923)

This is the most absurd predicament in the current AI industry: we are sprinting at full speed toward a goal whose finish line hasn't even been clearly drawn.

2025, Who is Redrawing the Starting Line for AGI?

Facing this definition vacuum, academia began intensively "filling the gap" in 2025. Scholars like Bengio emphasized "versatility" and "proficiency"; DeepMind proposed "distributed AGI," attempting to break the myth of a single, all-powerful entity.

However, Australian National University researcher Michael Timothy Bennett, in a paper submitted to arXiv at the end of March, provided an extremely provocative yet most incisive answer.

He pointed out that previous definitions, circling round and round, still grapple with the benchmark of an "educated adult." Bennett adopted scholar Pei Wang's definition of intelligence—viewing intelligence as adaptive capability under limited resources—fundamentally leaping out of the "human-like" framework and defining AGI as an "artificial scientist."

He proposed that true AGI should be a system capable of adapting widely, efficiently, and scientifically to new environments and tasks, like a human scientist, under real-world constraints such as computation, memory, and energy.

The subtext of this statement is: the criterion for judging AGI should not be how well it imitates humans, but how strong its ability is to "discover new knowledge."

Why is a new yardstick urgently needed? Because the old yardsticks—the Turing test and human benchmark tests—have been aced by large models, yet we are moving farther away from true general intelligence.

In 2025, if you ask a top-tier large model "which is larger, 9.11 or 9.9," it might still confidently tell you that 9.11 is larger because 11 is greater than 9. When solving complex mathematical inequality proofs, even if a large model guesses the correct answer, its reasoning process is often logically flawed.

Bennett pinpointed the root cause: current large models follow the "scale-maximizing approximation" route—using vast amounts of data and computing power to pre-store approximate answers for various tasks in the network weights. Once encountering out-of-distribution problems, they immediately falter.

More fatally, large models lack "proactive capabilities." They cannot actively conduct experiments to verify hypotheses, autonomously construct causal chains, or make trade-offs between "continued exploration" and "exploiting the known."

Returning to the comparison between 9.11 and 9.9—the large model isn't incapable of arithmetic; it simply hasn't built a causal model for number comparison. It is merely using probabilities to guess the most similar text fragment it has seen.

The chasm between "mimicry ability" and "adaptive ability" is precisely what the new AGI standard aims to measure.

The New Calibration of Intelligence: Deconstructing the "Artificial Scientist"

The reason Bennett's set of criteria deserves attention is that he downgraded AGI from a vague philosophical proposition to a quantifiable engineering problem.

In his view, a true AGI's behavioral pattern should perfectly align with the research paradigm of a human scientist:

First, from "Puppet on Strings" to "Active Experimenter."

Today's AI is a completely passive learner, only able to "see" the data fed to it by humans. But a scientist is not. If locked in an unfamiliar room, a scientist wouldn't just stand there waiting for information; they would push the door, pull the handle, check the windows—this is "active experimentation." True AGI must be capable of autonomously planning experiments and acquiring key information through proactive interaction.

Second, from "Knowing That" to "Knowing Why."

This is the biggest shortcoming of current AI. Large models are extreme "correlation learners." They know "rain" often accompanies "wet ground," but don't know which causes which. Only by understanding causality can they infer, when the sky is clear but the ground is wet, that a sprinkler truck passed by rather than rain is imminent. Without causal understanding, AI can only operate within the distribution of its training data, which has nothing to do with being "general."

Third, Walking the Tightrope Between "Exploration" and "Exploitation."

If it only explores without exploiting, it cannot solve immediate problems despite mastering vast knowledge; if it only exploits without exploring, it is helpless when the environment changes. AGI must dynamically balance this contradiction under resource constraints—knowing what it doesn't know and allocating computing power accordingly.

Furthermore, Bennett added a highly realistic dimension: energy constraints. Including "energy" in the definition means he draws a clear bottom line: true intelligence is not about possessing unlimited resources, but about elegantly adapting under limited resources. An AI that requires consuming an entire nuclear power plant to solve a new problem is just an expensive calculator, not AGI.

Route Reset Towards AGI: Farewell to the Singular Scaling Law

Based on the above framework, Bennett deconstructs the current meta-methods for building intelligent systems into three categories:

Scale-maxing: The mainstream route for current large models, frantically stacking parameters, data, and computing power. But the bottleneck is already apparent: extremely low sample and energy efficiency.

Simp-maxing (Simplicity-maximizing): Pursuing the ultimate simplicity of model structure, believing in Occam's razor. But simplicity is a property of form, not function—the "simplest" under different Turing machines may be entirely different, making it difficult to escape the trap of subjectivity.

W-maxing (Constraint-Weakening-maximizing): Weakening functional constraints as much as possible, allowing the system to find optimal solutions on its own. Experiments show that W-maxing alone can achieve 110%-500% generalization rate improvements on specific tasks, but it requires searching an infinite hardware morphology space, making optimization extremely difficult.

Bennett's conclusion is extremely clear: although Scale-maxing currently dominates absolutely, AGI will never be achieved through the brute-force aesthetics of a single route; it must be a fusion of multiple meta-methods.

If the definition of an "artificial scientist" is widely accepted, the AI industry will undergo a profound paradigm shift.

The evaluation criteria will completely change. We no longer need to see how many more points a large model surpasses on human exam leaderboards. Instead, we need to establish a set of "adaptability benchmarks": throw the AI into an unseen physical environment and see if it can discover patterns with limited interaction; give it a new game and see if it can understand the rules faster than humans; even have it tackle real scientific problems and see if it can autonomously propose hypotheses and design experiments to verify them. The core is no longer "how much you know," but "how much you can discover."

Technical routes will also shift accordingly. The simple Scaling Law will soon hit its ceiling because passively received data cannot feed causality. Search and approximation, scale maximization and constraint weakening—the achievement of AGI will necessarily be a fusion of various tools and meta-methods, not an extension of a single route.

The importance of Bennett's paper lies not in him providing the ultimate answer to AGI, but in him cleaning a corner of the blurred mirror named "intelligence." He shows us that achieving AGI is not a linear iteration of large models, but a route reset.

What should AGI actually look like? The answer lies not in conversations that increasingly resemble humans, but in the ability to actively ask "why" and personally verify the answers. When AI truly walks out of the mist of the "Rorschach inkblot test," it will no longer merely mimic human appearance but possess the spirit of a scientist.(This article was first published on Titanium Media APP, Author | Silicon Valley Tech News, Editor | Zhao Hongyu)

Strategy leaves preferred STRC dividend at 12% as price still below par

Strategy's preferred STRC shares remain priced significantly below their $100 par value, closing July at $89.46 despite a monthly gain. The company confirmed its August dividend will hold at the recently increased 12% annual rate, paid semi-monthly. Management's stated objective is for the shares to trade at $99-$100, though no timeline was given. The firm reported a large Q2 net loss due to unrealized losses on its Bitcoin holdings but has built a $3.75 billion cash reserve to support preferred dividend payments for over two years. It has also begun repurchasing STRC shares while they trade below par.

cointelegraph1h ago

Strategy leaves preferred STRC dividend at 12% as price still below par

cointelegraph1h ago

Bitcoin Withdrawals Continue: 8 Years of Storage in a Coldcard Cold Wallet Ended in Zero

Coldcard Hardware Wallet Hacked: Losses Mount Due to Vulnerable Seed Generation A critical vulnerability in Coldcard hardware wallets has led to a continued wave of fund thefts. According to Galaxy Research, the total stolen has reached 1,367.05 BTC (approx. $88.6 million) from 4,585 addresses, a significant increase from the initial 594.5 BTC reported on July 30, 2026. Most of the stolen funds remain on the attackers' addresses. The issue is not with the current firmware, which Coinkite has updated, but with seed phrases generated on vulnerable devices between March 2021 and the release of fixed firmware versions. Due to a programmer error, devices switched from using a hardware random number generator to the software-based Yasmarang generator, which was initialized with publicly accessible data like the chip's serial number. This made the seed phrases predictable through offline brute-force attacks, meaning wallets remain at risk until funds are moved to a new wallet generated with the patched firmware. Affected devices include Mk2/Mk3 with firmware 4.0.1–4.1.9 (and up to 5.0.3), Mk4/Mk5 up to version 5.6.0, and Q models up to 1.5.0Q. The only exceptions are seeds created with a high-entropy method like at least 50 independent dice rolls or a strong unique BIP-39 passphrase. All other owners must generate a new seed on the fixed firmware and transfer their assets. A case highlighting the human impact involves a 39-year-old long-term investor who lost 2 BTC (approx. $130,000) in minutes. He had accumulated the Bitcoin over eight years through physical labor, viewing it as a financial lifeline and a retirement plan in a country suffering from hyperinflation. His story underscores that even conservative "buy and hold in cold storage" strategies can be compromised by such underlying technical flaws. From a technical perspective, this incident echoes historical failures where weak random number generators undermined cryptographic security, challenging the assumption that offline storage is automatically foolproof.

cryptonews.ru1h ago

Bitcoin Withdrawals Continue: 8 Years of Storage in a Coldcard Cold Wallet Ended in Zero

cryptonews.ru1h ago

Explosive Growth in Trading Volumes of 15 Altcoins Observed in South Korea!

Major South Korean cryptocurrency exchanges Upbit and Bithumb have reported a significant surge in trading volumes for several altcoins. Over the past 24 hours, the total trading volume for the most popular altcoins reached approximately $347.7 million. MetaDAO (META) led the rankings with a trading volume of $65.84 million on Upbit alone, accounting for 12.39% of the exchange's total spot volume. Euler (EUL) followed in second place with a total volume of $47.65 million across both exchanges. XRP, which consistently attracts substantial interest from Korean investors, achieved a total volume of $38.11 million. Other notable altcoins in the top 15 by trading volume include ThunderCore (TT) at $35.64 million, Babylon (BABY) at $25.15 million, and Shiba Inu (SHIB) at $10.55 million.

cryptonews.ru2h ago

Explosive Growth in Trading Volumes of 15 Altcoins Observed in South Korea!

cryptonews.ru2h ago

Donald Trump's Company Sold Another Large Batch of Bitcoins!

Donald Trump's company, Trump Media & Technology Group, reportedly transferred another large batch of Bitcoin to the CryptoCom exchange. Blockchain analysis indicates that addresses linked to Trump Media moved approximately 2,628 BTC (worth around $165 million) to the exchange. Prior reports suggested the company had acquired a total of 11,542 BTC at an average price of $118,500. It is claimed that by 2026, about 7,281 BTC had been withdrawn from these addresses, with approximately 4,261 BTC still held on them. The total realized and unrealized losses from Trump Media's Bitcoin investments are estimated to be roughly $555 million. It is important to note that sending Bitcoin to an exchange does not definitively mean the assets were sold. Such transfers could also be for custody, liquidity management, or other financial operations. However, movements from cold wallets to centralized exchanges are commonly viewed as potential sales activity.

cryptonews.ru4h ago

Donald Trump's Company Sold Another Large Batch of Bitcoins!

cryptonews.ru4h ago

Parker Lewis Explains Why Bitcoin Remains the Best Money

Bitcoin analyst Parker Lewis criticized companies promoting themselves as "crypto treasuries" for selling perpetual preferred stock, calling it a distortion of Bitcoin's essence. He argues Bitcoin has no inherent yield, and promises of dividends from such corporate derivatives are risky, often relying on new investor inflows. Lewis highlighted the vast discrepancy between the $300 trillion global credit market and the $1 trillion perpetual preferred stock market, suggesting these instruments shift indefinite risks to retail investors. He also refuted the notion that Bitcoin is "too volatile," stating volatility is a natural mathematical outcome of a fixed-supply asset gaining mass adoption, as new users must bid higher to acquire it. Instead of buying shares of companies like MicroStrategy, Lewis advises direct Bitcoin ownership as safer. The focus on corporate derivatives distracts from the primary threat of fiat currency devaluation. Citing his informal "Ribeye Index," Lewis notes a steep rise in steak prices, indicating real inflation far exceeding official CPI figures. In conclusion, the most prudent strategy against inflation is direct ownership and self-custody of Bitcoin. Chasing corporate yield through crypto treasury stocks multiplies systemic risks, while understanding decentralized money protects savings from macroeconomic turmoil.

cryptonews.ru5h ago

Parker Lewis Explains Why Bitcoin Remains the Best Money