Behind the AI Report Card, Lies a Chinese 'Exam Setter'

marsbit发布于2026-06-20更新于2026-06-20

文章摘要

Beyond the familiar performance charts like MMLU-Pro and MMMU, which major AI models strive to ace, stands a key "examiner": Chinese-Canadian researcher Wenhu Chen. An assistant professor at the University of Waterloo and founder of TIGERLab, Chen addresses the crucial need for more rigorous AI evaluation. As models like GPT-4 began scoring near-perfect results on older benchmarks like MMLU, it became difficult to distinguish their true capabilities. In response, Chen introduced MMLU-Pro in 2024, featuring harder, more reasoning-focused questions with more answer choices, successfully reintroducing meaningful performance gaps. His work extends to multi-modal evaluation with MMMU and its enhanced version, MMMU-Pro. These benchmarks test a model's ability to understand and reason with complex information from images, charts, and text across diverse academic subjects, exposing the significant challenges even top models face in genuine comprehension. Chen's background in complex QA, table reasoning, and his experience at Google DeepMind on projects like Gemini inform his approach. He understands that effective benchmarks must anticipate how models might "cheat" by memorizing data or avoiding visual analysis. His lab also actively researches video understanding and generation models (e.g., UniVideo, Vamba), ensuring his evaluation work is grounded in practical model-building challenges. Now at Meta's Super Intelligence Lab, Chen continues his focus on multi-modal data and evalua...

Each time a cutting-edge model is released, the AI community focuses on a few familiar report cards.

MMLU-Pro, MMMU, MMMU-Pro... These names might sound foreign to ordinary users, but for model companies and researchers, they have almost become the "standard subjects." GPT, Claude, Gemini, Llama, Qwen, DeepSeek continuously submit their answers on these benchmarks.

"The proof is in the pudding." How good a model is often needs to be proven by these scores.

Many performance comparison charts in model launch presentations rely on them; some leaderboards on HuggingFace are also built upon these evaluation systems. It could even be said that today, when the AI industry discusses model capabilities, they are already using a common language defined by these benchmarks.

But interestingly, almost everyone focuses on the scores, yet few know who sets the questions. Behind MMLU-Pro, MMMU, and MMMU-Pro, the same name can be seen—Wenhu Chen.

He is an Assistant Professor in the Department of Computer Science at the University of Waterloo in Canada. On Google Scholar, his papers have been cited over 30,000 times.

He is also the founder of TIGERLab. The English full name of this lab is Text and Image GEnerative Research Lab. Because the Chinese word for "tiger" is in his name, Wenhu Chen gave it a very distinctive Chinese name—Hutou Bang (Tiger Head Gang).

01 After the Old Exam Papers Lost Their Effectiveness

Wenhu Chen first caught wider attention because of MMLU-Pro.

MMLU was once one of the most commonly used benchmark evaluations for assessing the capabilities of large language models. It was like a comprehensive test paper, covering multiple subjects, used to measure a model's performance in knowledge understanding and reasoning tasks.

Early on, this paper was very useful. It could distinguish between models through scores, and the industry could also use it to observe whether large language models were truly improving.

But problems soon emerged.

As model capabilities continuously improved, MMLU gradually became "insufficiently challenging." The scores of cutting-edge models got higher and higher, and the gaps between them became smaller and smaller.

After OpenAI released o3, this problem became even more apparent. The accuracy of o3 on MMLU was already close to 100%, and other cutting-edge models also successively submitted scores approaching full marks.

This might sound like good news, but for evaluation, it actually meant trouble.

If everyone can get close to full marks on an exam paper, it becomes very difficult to continue judging who is stronger and where their strengths lie. It can still prove that models possess certain capabilities, but it is no longer suitable for measuring new progress.

The AI industry needed a harder, less easily "fooled" exam paper.

In 2024, Wenhu Chen and his team launched MMLU-Pro.

MMLU-Pro revamped this exam paper rather than simply expanding the question bank.

It contains 12,032 questions, covering 14 fields including mathematics, physics, chemistry, law, engineering, psychology, and health. Compared to the original MMLU, it expands the options from 4 to 10, reducing the probability of models guessing correctly. It also incorporates more reasoning-oriented questions and cleans up the original question bank of questions that were relatively simple, ambiguous, or lacked sufficient discriminative power.

The effect was direct.

The paper's results showed that model accuracy on MMLU-Pro decreased by 16% to 33% compared to the original MMLU. When the same model was tested under 24 different prompt styles, the score variation also decreased from 4% to 5% in the original MMLU to about 2%.

In other words, this new paper is not only harder but also more stable.

It reopened the gaps between models that all seemed excellent on the old exam paper. It also became easier to tell whether a model truly understands reasoning or is just better at handling old-style questions.

02 Usable Benchmark Evaluations

MMLU-Pro was quickly adopted by the industry.

MMLU-Pro later entered the NeurIPS 2024 Datasets and Benchmarks track and was also integrated into EleutherAI's lm-evaluation-harness framework. For the open-source model community, this meant it was no longer just a dataset in a paper but had entered the common evaluation toolchain.

Many models began reporting MMLU-Pro scores upon release. Some leaderboards on HuggingFace also incorporated it into their evaluation systems.

If MMLU-Pro solved the problem of the "old exam paper losing effectiveness" in language model evaluation, then MMMU pushed Wenhu Chen and TIGERLab to the center of multimodal evaluation.

The problems with multimodal models are more complex.

Language models answer questions, mainly processing text. Multimodal models, however, have to simultaneously process information in different forms like images, charts, diagrams, maps, tables, musical scores, chemical structures, etc. They not only need to understand the question stem but also truly comprehend the content in the images, and reason by integrating visual information, textual information, and domain knowledge.

The MMMU benchmark contains 11,500 multimodal questions sourced from university exams, quizzes, and textbooks, covering six major domains: Arts & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering, further subdivided into 30 subjects and 183 subfields.

These questions are not simply asking the model "what's in the picture." They require the model to combine image information with domain knowledge, much like a student tackling a professional problem.

When MMMU was released, the research team tested 14 open-source multimodal models, as well as representative closed-source models like GPT-4V and Gemini Ultra. Even the strongest closed-source models at the time, GPT-4V and Gemini Ultra, only achieved accuracy rates of 56% and 59% respectively.

These numbers indicate that while multimodal models appear to be progressing rapidly, they still have significant room for improvement when it comes to problems requiring genuine professional understanding and reasoning.

Later, Wenhu Chen's team released MMMU-Pro, further plugging the gaps that allowed models to bypass visual information. It filters out questions that could be answered by text-only models, expands answer choices, and introduces a vision-only setting where questions are embedded within images, requiring the model to perform both visual reading and text comprehension simultaneously.

Simply put, it prevents the model from "guessing the answer just by looking at the text."

This kind of work might sound somewhat tedious, but it is crucial. Because future multimodal models need to enter scenarios like healthcare, education, scientific research, design, and engineering; merely being able to describe a picture is not enough. They must be able to judge, reason, explain, and find the truly useful parts within complex visual information.

03 The People Behind the "Exam Papers"

Wenhu Chen's later work on MMLU-Pro and MMMU stems from his long-standing research direction.

His research interests have always been related to complex information understanding, knowledge question answering, and reasoning.

He earned his bachelor's degree from Huazhong University of Science and Technology, then pursued a master's at RWTH Aachen University in Germany, followed by a Ph.D. in Computer Science from the University of California, Santa Barbara. During his Ph.D., he had already begun research in areas like complex question answering, table reasoning, and knowledge evidence localization.

These tasks share a common characteristic: the answer often does not lie within a single piece of text.

It might be hidden in a table, require combining a piece of text and an image, or might need the model to first retrieve information, then integrate, calculate, and reason. The model cannot just be good at reciting existing knowledge.

Projects Wenhu Chen participated in, such as HybridQA, TabFact, Program of Thoughts, and MAmmoTH, are all related to this line of work.

This also explains his sensitivity to loopholes in model evaluation.

A good benchmark evaluation is not simply about making questions increasingly difficult, but about anticipating where models are most likely to "guess correctly" or "appear competent."

A model might memorize the question bank, guess answers based on options, or use text to bypass visual information... Good evaluation needs to patch these loopholes well.

After his Ph.D., Wenhu Chen joined Google Research and later participated in the development and evaluation of Google DeepMind's Gemini multimodal model from 2021 to 2025. This experience was also important. Long-term exposure to cutting-edge model development gave him a clearer understanding of how model capabilities grow and made it easier to see potential biases and blind spots in evaluation.

In the fall of 2022, Wenhu Chen joined the David R. Cheriton School of Computer Science at the University of Waterloo as an Assistant Professor. The same year, he was selected as a Canada CIFAR AI Chair. Subsequently, he founded "TIGERLab" (aka Hutou Bang), continuing research focused on foundation models, multimodal capabilities, and benchmark evaluations.

Hutou Bang doesn't just work on benchmark evaluations; they also conduct model and system research.

In the video direction, UniVideo attempts to place video understanding, generation, and editing within the same framework, allowing the model not only to generate a sequence of frames but also to understand content, respond to instructions, and complete edits. Vamba targets long video understanding, addressing the memory, computation, and training efficiency challenges posed by hour-long videos. MoCha, a collaboration with Meta's Generative AI team, focuses on talking virtual character generation, producing high-quality character videos from voice and text descriptions.

An exam setter who never takes tests themselves cannot set good questions. Building models themselves, in turn, makes them more suitable for evaluation.

Because truly good evaluation often comes from an understanding of model capability boundaries. Only by knowing how models are built and what problems they encounter in real tasks can one more easily design questions that can differentiate performance and expose weaknesses.

Now, Wenhu Chen has joined Meta's Superalignment Lab, where his work continues to focus on multimodal pretraining data and evaluation, serving Meta's foundation models.

The AI industry does not lack visible figures. Typically, the spotlight falls on entrepreneurs, star researchers, and heads of large model companies. New product launches, funding news, open-source models, and team adjustments often attract the most external attention, making these names more visible to the public.

But today, the participation of Chinese talent in the AI field extends far beyond these most conspicuous positions.

This article is from the WeChat public account "Letters AI", author: Jin Ya

你可能也喜欢

XRP 在谐波模式和关键支撑区附近形成反转形态

根据分析师The_Alchemist_Trader_于6月20日的分析，XRP当前正处在一个关键技术区域，这可能决定其下一轮主要走势。XRPUSD目前位于一个关键的支撑区域，该区域汇聚了多重技术因素，包括0.618斐波那契回撤位和当前交易区间的成交量中枢。这种技术汇合区域通常被视为多空双方的重要博弈点。图表分析显示，XRP可能正在构筑一个谐波反转形态的基础。虽然这种形态并非百分之百准确，但它为交易者提供了一个识别潜在转折点、失效区域和目标的框架。当前的核心在于观察价格在该支撑区域的表现：若能出现强劲反弹并伴随成交量支持，将强化反转论点；反之，若需求疲弱、支撑被反复测试并最终跌破，则反转论点将难以成立。对于XRP多头而言，关键在于能否出现持续性的上涨动能，而不仅仅是短暂的反弹。这意味着价格需要有效突破附近阻力位，并维持成交量，避免快速回落至同一支撑区。若缺乏这些要素，市场可能将其视为又一次失败的反弹。总之，该分析应被视为一个值得关注的技术形态“设定”，而非价格预测。支撑区域明确，技术汇合显著，潜在的反转结构值得观察，但最终仍需市场通过实际价格走势来确认。

bitcoinist1小时前

bitcoinist1小时前

Vitalik算法稳定币设想：从期权视角解读机制与挑战

Vitalik Buterin 近期提出了一种新颖的算法稳定币设想，该方案从期权视角出发，试图规避传统抵押债务头寸（CDP）稳定币的清算风险。其核心设计是将1单位ETH拆分为两类权益：稳定侧（P）和上涨侧（N）。稳定侧类似一个深度实值的合成备兑看涨期权，旨在获得低于某一执行价的“稳定”价值；上涨侧则获得执行价以上的全部上涨收益。两部分权益之和始终等于1 ETH，因此系统无需债务、保证金或清算机制。然而，该设计面临显著挑战。为实现稳定，稳定侧资产需持续滚动（展期）深度实值期权，这会带来交易滑点、固定交易路径被抢跑以及流动性不足等问题。更根本的难点在于，每一份稳定资产的创造，都需要有参与者持续持有对应的上涨侧资产——这本质上是一种无资金费率、无清算风险的杠杆ETH多头。该需求能否长期、规模化存在，是系统成功扩张的关键。作者结合其DeFi期权协议Rysk的经验指出，期权过去难以在DeFi中成为主流交易产品，因其过于复杂。但若转变思路，将期权作为底层金融基础模块，用于构建稳定币、结构化收益产品等更复杂的资产，则可能展现出巨大潜力。期权的未来机会或许不在于成为下一个永续合约，而在于成为下一代链上金融产品的定价与风险分配引擎。

marsbit1小时前

marsbit1小时前

SpaceX、AI与XRP：为什么下一次财富转移可能不同？

本文探讨了全球资本可能正从追逐短期投机转向投资下一代经济基础设施的趋势。核心观点认为，SpaceX的上市预期、人工智能、区块链支付网络（如XRP、XLM）、大宗商品需求和数字资产监管明确化，共同指向一个以太空基础设施、AI算力、数据中心和实时金融结算为核心的新投资周期。文章指出，未来基础设施扩张将驱动对黄金、铜等大宗商品的长期需求。同时，区块链技术可能超越资产投机，演变为支持AI Agent交易、代币化资产和全球即时结算的金融轨道。Ripple及其关联方与太空商业项目（如SpaceX、Vast）的联系，暗示了区块链与新兴实体基础设施融合的潜力。作者强调，随着AI自主代理和太空经济发展，对高效、互操作支付层的需求将增长。监管框架的清晰化有望推动机构采用。最终，市场叙事或将从投机转向关注实际应用、交易量和网络整合，数字资产的价值基础可能随之转变。投资者面临的关键问题不再是技术是否会融合，而是融合的速度以及哪些网络将成为未来经济系统的底层支柱。文章认为，早期识别并布局这些基础设施要素至关重要。

marsbit2小时前

marsbit2小时前

GPT-5.6倒计时：放弃单一API幻想，算力迭代再快也敌不过一纸合规

6月中旬，全球AI产业迎来关键转折。Anthropic的Fable 5模型因合规问题上线仅72小时即对非美国公民限流，凸显了前沿技术面临的地缘政治与合规风险。与此同时，智谱AI宣布开源GLM-5.2，其在多项长程任务上的表现已接近传统闭源旗舰，加之显著的成本优势，推动了由闭源向开源迁移的商业需求，成为企业应对合规风险的冗余备份。为应对开源追赶，OpenAI即将发布的GPT-5.6将重心从“语言智能”转向重度依赖算力的“空间智能（世界模型）”，旨在通过3D理解、物理仿真等复杂领域重建技术壁垒。然而，Fable 5的遭遇表明，技术先进性已无法单独保障产品的可用性。文章指出，全球大模型供应链正步入“受控闭源”与“本地开源”并存的双轨制阶段。对应用层开发者而言，业务连续性高度依赖于“模型无关性”的架构设计，必须能够快速从受限的闭源API切换至可控的开源方案，合规与访问稳定性已成为与技术性能同等重要的评估标尺。

marsbit4小时前

marsbit4小时前

AI巨头的“Token补贴大战”，快打完了吗？

目前，AI巨头正通过高额补贴进行“Token价格战”，用户实际支付的费用远低于Token的真实成本，高端套餐补贴甚至可达订阅费的70倍。与互联网时代靠补贴建立用户锁定后涨价不同，AI的Token几乎没有锁定效应，用户可轻易在不同模型间切换。谷歌等拥有稳定现金流的巨头，能用广告利润持续补贴，而OpenAI和Anthropic等依赖融资的公司，在上市后将面临盈利压力。有观点认为，谷歌若将Token价格大幅下调，将对后者的商业模式构成严峻挑战。这场竞争的结局可能并非一家独大。由于产品差异小、切换成本低，Token可能像水电煤一样成为标准化基础设施，利润空间被挤压。竞争参与者（如OpenAI、谷歌、Anthropic）的目标可能不是彻底打败对方，而是确保自己始终留在牌桌上，并通过竞争共同推动技术普及和进步。最终，AI Token或许会成为一种公共基础资源，难以被任何单一公司垄断。对用户而言，价格战持续期间，仍能享受到远低于成本的AI服务。

marsbit4小时前

marsbit4小时前

交易

现货

合约

Behind the AI Report Card, Lies a Chinese 'Exam Setter'

文章摘要

01

After the Old Exam Papers Lost Their Effectiveness

02

Usable Benchmark Evaluations

03

The People Behind the "Exam Papers"

热门币种推荐

相关问答

你可能也喜欢

XRP 在谐波模式和关键支撑区附近形成反转形态

Vitalik算法稳定币设想：从期权视角解读机制与挑战

SpaceX、AI与XRP：为什么下一次财富转移可能不同？

GPT-5.6倒计时：放弃单一API幻想，算力迭代再快也敌不过一纸合规

AI巨头的“Token补贴大战”，快打完了吗？

交易

热门文章

如何购买EDGE

相关讨论

热门问答

热门分类

热门标签