Behind the AI Report Card, Lies a Chinese 'Exam Setter'

marsbitPublished on 2026-06-20Last updated on 2026-06-20

Abstract

Beyond the familiar performance charts like MMLU-Pro and MMMU, which major AI models strive to ace, stands a key "examiner": Chinese-Canadian researcher Wenhu Chen. An assistant professor at the University of Waterloo and founder of TIGERLab, Chen addresses the crucial need for more rigorous AI evaluation. As models like GPT-4 began scoring near-perfect results on older benchmarks like MMLU, it became difficult to distinguish their true capabilities. In response, Chen introduced MMLU-Pro in 2024, featuring harder, more reasoning-focused questions with more answer choices, successfully reintroducing meaningful performance gaps. His work extends to multi-modal evaluation with MMMU and its enhanced version, MMMU-Pro. These benchmarks test a model's ability to understand and reason with complex information from images, charts, and text across diverse academic subjects, exposing the significant challenges even top models face in genuine comprehension. Chen's background in complex QA, table reasoning, and his experience at Google DeepMind on projects like Gemini inform his approach. He understands that effective benchmarks must anticipate how models might "cheat" by memorizing data or avoiding visual analysis. His lab also actively researches video understanding and generation models (e.g., UniVideo, Vamba), ensuring his evaluation work is grounded in practical model-building challenges. Now at Meta's Super Intelligence Lab, Chen continues his focus on multi-modal data and evalua...

Each time a cutting-edge model is released, the AI community focuses on a few familiar report cards.

MMLU-Pro, MMMU, MMMU-Pro... These names might sound foreign to ordinary users, but for model companies and researchers, they have almost become the "standard subjects." GPT, Claude, Gemini, Llama, Qwen, DeepSeek continuously submit their answers on these benchmarks.

"The proof is in the pudding." How good a model is often needs to be proven by these scores.

Many performance comparison charts in model launch presentations rely on them; some leaderboards on HuggingFace are also built upon these evaluation systems. It could even be said that today, when the AI industry discusses model capabilities, they are already using a common language defined by these benchmarks.

But interestingly, almost everyone focuses on the scores, yet few know who sets the questions. Behind MMLU-Pro, MMMU, and MMMU-Pro, the same name can be seen—Wenhu Chen.

He is an Assistant Professor in the Department of Computer Science at the University of Waterloo in Canada. On Google Scholar, his papers have been cited over 30,000 times.

He is also the founder of TIGERLab. The English full name of this lab is Text and Image GEnerative Research Lab. Because the Chinese word for "tiger" is in his name, Wenhu Chen gave it a very distinctive Chinese name—Hutou Bang (Tiger Head Gang).

01 After the Old Exam Papers Lost Their Effectiveness

Wenhu Chen first caught wider attention because of MMLU-Pro.

MMLU was once one of the most commonly used benchmark evaluations for assessing the capabilities of large language models. It was like a comprehensive test paper, covering multiple subjects, used to measure a model's performance in knowledge understanding and reasoning tasks.

Early on, this paper was very useful. It could distinguish between models through scores, and the industry could also use it to observe whether large language models were truly improving.

But problems soon emerged.

As model capabilities continuously improved, MMLU gradually became "insufficiently challenging." The scores of cutting-edge models got higher and higher, and the gaps between them became smaller and smaller.

After OpenAI released o3, this problem became even more apparent. The accuracy of o3 on MMLU was already close to 100%, and other cutting-edge models also successively submitted scores approaching full marks.

This might sound like good news, but for evaluation, it actually meant trouble.

If everyone can get close to full marks on an exam paper, it becomes very difficult to continue judging who is stronger and where their strengths lie. It can still prove that models possess certain capabilities, but it is no longer suitable for measuring new progress.

The AI industry needed a harder, less easily "fooled" exam paper.

In 2024, Wenhu Chen and his team launched MMLU-Pro.

MMLU-Pro revamped this exam paper rather than simply expanding the question bank.

It contains 12,032 questions, covering 14 fields including mathematics, physics, chemistry, law, engineering, psychology, and health. Compared to the original MMLU, it expands the options from 4 to 10, reducing the probability of models guessing correctly. It also incorporates more reasoning-oriented questions and cleans up the original question bank of questions that were relatively simple, ambiguous, or lacked sufficient discriminative power.

The effect was direct.

The paper's results showed that model accuracy on MMLU-Pro decreased by 16% to 33% compared to the original MMLU. When the same model was tested under 24 different prompt styles, the score variation also decreased from 4% to 5% in the original MMLU to about 2%.

In other words, this new paper is not only harder but also more stable.

It reopened the gaps between models that all seemed excellent on the old exam paper. It also became easier to tell whether a model truly understands reasoning or is just better at handling old-style questions.

02 Usable Benchmark Evaluations

MMLU-Pro was quickly adopted by the industry.

MMLU-Pro later entered the NeurIPS 2024 Datasets and Benchmarks track and was also integrated into EleutherAI's lm-evaluation-harness framework. For the open-source model community, this meant it was no longer just a dataset in a paper but had entered the common evaluation toolchain.

Many models began reporting MMLU-Pro scores upon release. Some leaderboards on HuggingFace also incorporated it into their evaluation systems.

If MMLU-Pro solved the problem of the "old exam paper losing effectiveness" in language model evaluation, then MMMU pushed Wenhu Chen and TIGERLab to the center of multimodal evaluation.

The problems with multimodal models are more complex.

Language models answer questions, mainly processing text. Multimodal models, however, have to simultaneously process information in different forms like images, charts, diagrams, maps, tables, musical scores, chemical structures, etc. They not only need to understand the question stem but also truly comprehend the content in the images, and reason by integrating visual information, textual information, and domain knowledge.

The MMMU benchmark contains 11,500 multimodal questions sourced from university exams, quizzes, and textbooks, covering six major domains: Arts & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering, further subdivided into 30 subjects and 183 subfields.

These questions are not simply asking the model "what's in the picture." They require the model to combine image information with domain knowledge, much like a student tackling a professional problem.

When MMMU was released, the research team tested 14 open-source multimodal models, as well as representative closed-source models like GPT-4V and Gemini Ultra. Even the strongest closed-source models at the time, GPT-4V and Gemini Ultra, only achieved accuracy rates of 56% and 59% respectively.

These numbers indicate that while multimodal models appear to be progressing rapidly, they still have significant room for improvement when it comes to problems requiring genuine professional understanding and reasoning.

Later, Wenhu Chen's team released MMMU-Pro, further plugging the gaps that allowed models to bypass visual information. It filters out questions that could be answered by text-only models, expands answer choices, and introduces a vision-only setting where questions are embedded within images, requiring the model to perform both visual reading and text comprehension simultaneously.

Simply put, it prevents the model from "guessing the answer just by looking at the text."

This kind of work might sound somewhat tedious, but it is crucial. Because future multimodal models need to enter scenarios like healthcare, education, scientific research, design, and engineering; merely being able to describe a picture is not enough. They must be able to judge, reason, explain, and find the truly useful parts within complex visual information.

03 The People Behind the "Exam Papers"

Wenhu Chen's later work on MMLU-Pro and MMMU stems from his long-standing research direction.

His research interests have always been related to complex information understanding, knowledge question answering, and reasoning.

He earned his bachelor's degree from Huazhong University of Science and Technology, then pursued a master's at RWTH Aachen University in Germany, followed by a Ph.D. in Computer Science from the University of California, Santa Barbara. During his Ph.D., he had already begun research in areas like complex question answering, table reasoning, and knowledge evidence localization.

These tasks share a common characteristic: the answer often does not lie within a single piece of text.

It might be hidden in a table, require combining a piece of text and an image, or might need the model to first retrieve information, then integrate, calculate, and reason. The model cannot just be good at reciting existing knowledge.

Projects Wenhu Chen participated in, such as HybridQA, TabFact, Program of Thoughts, and MAmmoTH, are all related to this line of work.

This also explains his sensitivity to loopholes in model evaluation.

A good benchmark evaluation is not simply about making questions increasingly difficult, but about anticipating where models are most likely to "guess correctly" or "appear competent."

A model might memorize the question bank, guess answers based on options, or use text to bypass visual information... Good evaluation needs to patch these loopholes well.

After his Ph.D., Wenhu Chen joined Google Research and later participated in the development and evaluation of Google DeepMind's Gemini multimodal model from 2021 to 2025. This experience was also important. Long-term exposure to cutting-edge model development gave him a clearer understanding of how model capabilities grow and made it easier to see potential biases and blind spots in evaluation.

In the fall of 2022, Wenhu Chen joined the David R. Cheriton School of Computer Science at the University of Waterloo as an Assistant Professor. The same year, he was selected as a Canada CIFAR AI Chair. Subsequently, he founded "TIGERLab" (aka Hutou Bang), continuing research focused on foundation models, multimodal capabilities, and benchmark evaluations.

Hutou Bang doesn't just work on benchmark evaluations; they also conduct model and system research.

In the video direction, UniVideo attempts to place video understanding, generation, and editing within the same framework, allowing the model not only to generate a sequence of frames but also to understand content, respond to instructions, and complete edits. Vamba targets long video understanding, addressing the memory, computation, and training efficiency challenges posed by hour-long videos. MoCha, a collaboration with Meta's Generative AI team, focuses on talking virtual character generation, producing high-quality character videos from voice and text descriptions.

An exam setter who never takes tests themselves cannot set good questions. Building models themselves, in turn, makes them more suitable for evaluation.

Because truly good evaluation often comes from an understanding of model capability boundaries. Only by knowing how models are built and what problems they encounter in real tasks can one more easily design questions that can differentiate performance and expose weaknesses.

Now, Wenhu Chen has joined Meta's Superalignment Lab, where his work continues to focus on multimodal pretraining data and evaluation, serving Meta's foundation models.

The AI industry does not lack visible figures. Typically, the spotlight falls on entrepreneurs, star researchers, and heads of large model companies. New product launches, funding news, open-source models, and team adjustments often attract the most external attention, making these names more visible to the public.

But today, the participation of Chinese talent in the AI field extends far beyond these most conspicuous positions.

This article is from the WeChat public account "Letters AI", author: Jin Ya

Trending Cryptos

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

Solana launches SGP governance as staked supply crosses 68%: Details

The Solana Foundation has launched Solana Governance Proposals (SGP), a new on-chain mechanism allowing validators to vote on ecosystem issues through stake-weighted voting. With nearly 68% of SOL's supply already staked, SGP aims to boost validator participation, further strengthening network security and health. The launch coincides with strong on-chain momentum for Solana, including record Real-World Asset (RWA) value exceeding $3 billion and a 17.3% spike in Open Interest to $2.3 billion, potentially driven by growing institutional demand and infrastructure like SPCX. Despite broader market weakness, factors like corporate treasury accumulation and a potential tightening of liquid supply from increased staking suggest SOL's movement around $80 may signal trend continuation rather than mere resistance.

ambcrypto29m ago

Solana launches SGP governance as staked supply crosses 68%: Details

ambcrypto29m ago

‘Reduce these barriers’ – Can Arcus save dYdX from its 45% crash?

dYdX Labs, the team behind the decentralized exchange dYdX, has launched a new platform called Arcus on the Robinhood Chain. Arcus is separate from dYdX and offers zero-fee, 24/7 trading of 95 tokenized stocks and perpetuals, aiming to democratize access to financial markets. The announcement had been anticipated, causing the DYDX token to nearly double in price beforehand. However, upon the official release, DYDX experienced a sharp 45% decline in a classic "sell-the-news" event. The dYdX Foundation clarified that the Arcus update has no direct impact on the DYDX token, which remains the governance and staking token for dYdX Chain. On-chain data showed no significant spot demand for DYDX following the news, as gains had already been front-run. Potential support levels are seen at the 200-day moving average and a key trendline, suggesting the sell-off could present a buying opportunity if broader market sentiment improves.

ambcrypto1h ago

‘Reduce these barriers’ – Can Arcus save dYdX from its 45% crash?

ambcrypto1h ago

Crypto hacks hit record high in H1 2026 – What’s fueling the surge?

According to TRM Labs, crypto hacks reached a record high in the first half of 2026 with 207 incidents, more than double the 85 reported in H1 2025. While the number of breaches surged, the total value stolen fell to $972 million, less than half the $2.3 billion lost in the same period the previous year. Smart contract exploits were the most common attack type, constituting 125 cases. However, TRM Labs' Ari Redbord noted that three-quarters of the stolen value came from infrastructure failures like compromised keys and custody systems, highlighting a gap between improved code auditing and lagging operational security. Notably, North Korean-linked actors were responsible for 66% of the stolen funds. Major exploits, such as the $293 million attack on KelpDAO, triggered liquidity crises in protocols like Aave and contributed to a broader loss of confidence in DeFi, leading to significant capital outflows and a drop in Total Value Locked to a two-year low of $70 billion.

ambcrypto3h ago

Crypto hacks hit record high in H1 2026 – What’s fueling the surge?

ambcrypto3h ago

But Bin's Latest Speech: Do Not Miss Out on a Great Era

Dan Bin's Latest Speech: Don't Miss a Great Era On June 29th, Dan Bin, Chairman of Dongfang Harbor (东方港湾), delivered a keynote speech titled "Don't Miss a Great Era" at the "2026—All in the Silicon-based New Epoch" Mid-Year Strategy Summit. Addressing concerns about an AI bubble, Dan Bin argued from an industrial cycle perspective that "the risk of missing an era may be greater than worrying about short-term bubbles." He views humanity as standing at the dawn of the AI era, which could be more disruptive than the electronics, internet, and mobile internet eras. He posits that the AI wave is unlikely to end in just three or four years. Using the internet era's decade-long rhythm as a reference point—with ChatGPT's late 2022 launch as the starting line—a key risk assessment window may only arrive around 2033. Dan Bin emphasized that technological progress is the primary driver of long-term capital market growth, while factors like trade wars or interest rate hikes are secondary. Expanding to a civilizational scale, Dan Bin presented a thought experiment on silicon-based life potentially supplementing or succeeding carbon-based life as a direction for extending Earth's civilization, especially over cosmic timescales spanning billions of years. On geopolitics, he noted that AI is already rewriting warfare rules, as seen in conflicts like Ukraine, and that neither the U.S. nor China can afford to lose the AI race, with each leveraging different strengths. Reflecting on investment lessons, Dan Bin cited Warren Buffett's and Charlie Munger's admitted "regrets" about missing major tech opportunities like Microsoft, underscoring the need for continuous cognitive evolution. His firm, Dongfang Harbor, is deepening its research in foundational AI areas like computing power and storage. Dan Bin concluded by urging investors to maintain a long-term perspective, embrace the epochal shift, and rationally hold onto the opportunities presented by this transformative age. He closed with a poetic reminder: "The tide never turns back... Born in this time is a great fortune in itself. Don't let hesitation trap your steps, nor short-sightedness waste the years—do not miss this magnificent era that belongs to us."

marsbit4h ago

But Bin's Latest Speech: Do Not Miss Out on a Great Era

marsbit4h ago

Latest Speech by Dan Bin: Do Not Miss Out on a Great Era

Dan Bin, Chairman of Dongfang Harbor, delivered a keynote speech titled "Don't Miss a Great Era" at the Glonghui "2026—All in Silicon-Based New纪元" Mid-Year Strategy Summit on June 29th. Addressing concerns about an AI bubble, he argued from an industrial cycle perspective that the risk of missing an entire epoch far outweighs the risk of short-term泡沫. He positioned humanity at the dawn of the AI era, which he views as potentially more disruptive than the electronic, internet, and mobile internet eras. Dan Bin suggested the AI wave is unlikely to end in just three to four years. Drawing a parallel to the internet era's decade-long cycle starting from the 1994 Netscape IPO, he indicated that with ChatGPT's late-2022 launch as a marker, a key risk assessment point might not arrive until around 2033. He emphasized that technological progress is the primary driver of long-term capital market growth, with factors like trade wars and interest rates being secondary. Expanding his perspective to a civilizational scale, Dan Bin presented a thought experiment on silicon-based life potentially replacing carbon-based life as a direction for延续 Earth's civilization, especially given cosmic timescales and interstellar travel challenges. He noted AI's必然 weaponization, citing examples from the Russia-Ukraine war, and stated that neither the U.S. nor China can afford to lose the AI race, with each having distinct competitive advantages. Reflecting on investment lessons, he mentioned Warren Buffett's recent moves into tech like Google and查理·芒格's expressed regret about missing Microsoft's massive growth, underscoring the need for continuous认知迭代. Dan Bin concluded by urging investors to maintain a long-term perspective, focus on core technological trends, and rationally embrace the opportunities of this transformative era, so as not to辜负 this "great时代" defined by波澜壮阔 change.

链捕手4h ago

Latest Speech by Dan Bin: Do Not Miss Out on a Great Era

链捕手4h ago

Trading

Spot

Hot Articles

How to Buy EDGE

Welcome to HTX.com! We've made purchasing edgeX (EDGE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy edgeX (EDGE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your edgeX (EDGE)After purchasing your edgeX (EDGE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade edgeX (EDGE)Easily trade edgeX (EDGE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

3.5k Total ViewsPublished 2026.03.31Updated 2026.06.02

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of EDGE (EDGE) are presented below.

Hot Categories

技术发展558 news items

Behind the AI Report Card, Lies a Chinese 'Exam Setter'

Abstract

01

After the Old Exam Papers Lost Their Effectiveness

02

Usable Benchmark Evaluations

03

The People Behind the "Exam Papers"