Behind the AI Scorecards Lies a Chinese 'Question Setter'

marsbitPublished on 2026-06-19Last updated on 2026-06-19

Abstract

Behind the AI scorecards that dominate industry discussions—benchmarks like MMLU-Pro, MMMU, and MMMU-Pro—stands a Chinese-Canadian researcher: Wenhu Chen. As an assistant professor at the University of Waterloo and founder of the TIGER Lab, Chen has become a key "exam-setter" for evaluating large language and multimodal models. Chen first gained broader recognition with MMLU-Pro, a more challenging and stable update to the popular MMLU benchmark. As top models like OpenAI’s o3 began achieving near-perfect scores on the original MMLU, it became difficult to distinguish their true capabilities. MMLU-Pro introduced more complex reasoning questions, expanded answer choices, and filtered out ambiguous or simple items, effectively reintroducing differentiation among state-of-the-art models. His work on MMMU addressed the evaluation of multimodal models, requiring them to integrate visual information (like charts, diagrams, or tables) with textual knowledge across diverse academic subjects. Even the strongest models initially scored only around 56-59%, highlighting significant room for improvement in genuine multimodal reasoning. MMMU-Pro further refined this by preventing models from bypassing visual cues. Chen’s research focus has long been on complex information understanding and reasoning. His background—including a PhD at UC Santa Barbara, research at Google/DeepMind on Gemini, and now a role in Meta’s superintelligence lab—provides deep insight into model development and th...

By | Zimu AI

With every release of a frontier model, the AI community fixates on a few familiar scorecards.

MMLU-Pro, MMMU, MMMU-Pro... While these names might be unfamiliar to the average user, for model companies and researchers, they have essentially become "standard subjects." GPT, Claude, Gemini, Llama, Qwen, DeepSeek, and others continually submit their answers on these benchmarks.

"Put it to the test" - a model's performance often hinges on these scores for proof.

Many performance comparison charts in model launch presentations rely on them; some leaderboards on HuggingFace are also built upon these evaluation systems. It could even be said that when discussing model capabilities today, the AI industry is using a common language largely defined by these benchmarks.

Interestingly, while almost everyone focuses on the scores, few know who the question setters are. And behind MMLU-Pro, MMMU, and MMMU-Pro, one can find the same name—Wenhu Chen.

He is an Assistant Professor in the Computer Science Department at the University of Waterloo in Canada. On Google Scholar, his papers have been cited over 30,000 times.

He is also the founder of the "TIGER Lab" - the Text and Image GEnerative Research Lab. Because the name contains the Chinese character for "tiger" (虎, Hu), Chen Wenhu gave it a highly recognizable Chinese name—虎头帮 (Hutou Bang, Tiger Head Gang).

After the Old Exam Paper Fails

Chen Wenhu first caught wider attention because of MMLU-Pro.

MMLU was once one of the most commonly used benchmark evaluations for assessing the capabilities of large language models. It resembled a comprehensive test paper, covering multiple subjects, used to measure a model's performance in knowledge understanding and reasoning tasks.

In the early days, this paper was very useful. The scores could distinguish between models, and the industry could observe through it whether LLMs were genuinely improving.

But problems soon emerged.

As model capabilities continuously improved, MMLU gradually became "inadequate." The scores of frontier models got higher and higher, and the gaps between them grew smaller and smaller.

The issue became even more pronounced after OpenAI released o3. o3's accuracy on MMLU approached 100%, and other frontier models also subsequently submitted near-perfect scores.

This sounds like good news, but for evaluation purposes, it actually spells trouble.

If everyone scores close to full marks on an exam paper, it becomes difficult to continue judging who is stronger and where their strengths lie. It can still prove that models possess certain capabilities but is no longer suitable for measuring new progress.

The AI industry needed a harder, less "cheatable" exam paper.

In 2024, Chen Wenhu and his team introduced MMLU-Pro.

MMLU-Pro revamped this exam paper rather than simply expanding the question bank.

It contains 12,032 questions, covering 14 fields including mathematics, physics, chemistry, law, engineering, psychology, and health. Compared to the original MMLU, it expanded the multiple-choice options from 4 to 10, reducing the probability of models guessing correctly. It also incorporated more reasoning-oriented questions and filtered out relatively simple, ambiguous, or poorly discriminative questions from the original bank.

The effect was direct.

Paper results showed that model accuracy on MMLU-Pro decreased by 16% to 33% compared to the original MMLU. When testing the same model with 24 different prompt styles, score fluctuation also decreased from 4% to 5% on the original MMLU to about 2%.

In other words, this new paper is not only harder but also more stable.

It re-established gaps between models that all seemed excellent on the old exam paper. It also became easier to discern whether a model truly understands reasoning or is merely better at handling old-style questions.

Useful Benchmark Evaluations

The industry soon adopted MMLU-Pro.

MMLU-Pro subsequently entered the NeurIPS 2024 Datasets and Benchmarks Track and was integrated into EleutherAI's language model evaluation framework, lm-evaluation-harness. For the open-source model community, this meant it was no longer just a dataset in a paper but had entered the common evaluation toolchain.

Many model releases began reporting MMLU-Pro scores. Some leaderboards on HuggingFace also incorporated it into their evaluation systems.

If MMLU-Pro solved the problem of the "old exam paper failing" in language model evaluation, then MMMU propelled Chen Wenhu and TIGER Lab to the center of multimodal evaluation.

The problem with multimodal models is more complex.

Language models answer questions, primarily processing text. Multimodal models must simultaneously handle information in various forms: images, charts, diagrams, maps, tables, musical scores, chemical structures, etc. It's not just about understanding the question stem; it must truly comprehend the content within the images and integrate visual information, textual information, and subject knowledge for reasoning.

The MMMU benchmark contains 11,500 multimodal questions sourced from university exams, quizzes, and textbooks. It covers six major domains: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering, further subdivided into 30 subjects and 183 subfields.

These questions don't simply ask the model "what's in the picture"; they require the model to combine image information with subject knowledge, much like a student solving a professional problem.

When MMMU was released, the research team tested 14 open-source multimodal models, as well as representative closed-source models like GPT-4V and Gemini Ultra. Even the strongest closed-source models at the time, GPT-4V and Gemini Ultra, only achieved accuracy rates of 56% and 59% respectively.

These numbers indicate that while multimodal models appear to be advancing rapidly, they still have substantial room for improvement on problems requiring genuine professional understanding and reasoning.

Later, Chen Wenhu's team launched MMMU-Pro, further closing avenues for models to bypass visual information. It filtered out questions that text-only models could also answer, expanded answer choices, and introduced a vision-only setting, embedding the question within the image, requiring the model to perform both visual reading and text comprehension simultaneously.

Simply put, it prevents the model from "guessing the answer by only reading the text."

This kind of work might sound somewhat tedious, but it's crucial. Because multimodal models will enter scenarios like healthcare, education, scientific research, design, and engineering in the future, merely describing images is insufficient. They must be capable of judgment, reasoning, explanation, and identifying truly useful information within complex visual data.

The Person Behind the "Exam Papers"

Chen Wenhu's later work on MMLU-Pro and MMMU stemmed from his long-standing research focus.

His research interests have always been related to complex information understanding, knowledge question answering, and reasoning.

He earned his bachelor's degree from Huazhong University of Science and Technology, then pursued a master's at RWTH Aachen University in Germany, and obtained his Ph.D. in Computer Science from the University of California, Santa Barbara. During his Ph.D., he was already conducting research in areas like complex QA, table reasoning, and knowledge evidence localization.

These tasks share a common characteristic: the answer is often not found within a single piece of text.

It might be hidden within a table, require combining a passage of text and an image, or necessitate the model to first retrieve information, then integrate, calculate, and reason. The model cannot merely recite existing knowledge.

Projects Chen Wenhu has been involved in, such as HybridQA, TabFact, Program of Thoughts, and MAmmoTH, are all related to this line of work.

This also explains his sensitivity to loopholes in model evaluation.

A good benchmark evaluation is not simply about making questions increasingly difficult; it's about anticipating where models are most likely to "guess correctly" or "appear competent."

A model might memorize the question bank, guess answers based on options, or use text to circumvent visual information... A good evaluation must patch these vulnerabilities.

After completing his Ph.D., Chen Wenhu joined Google Research and later worked on Google DeepMind's Gemini multimodal model and evaluation from 2021 to 2025. This experience was also significant. Long-term exposure to frontier model development gave him a clearer understanding of how model capabilities grow and made it easier to spot potential biases and blind spots in evaluation.

In the fall of 2022, Chen Wenhu joined the School of Computer Science at the University of Waterloo as an Assistant Professor. That same year, he was selected as a Canada CIFAR AI Chair. Subsequently, he founded the "TIGER Lab (aka Hutou Bang)" and continued research around foundation models, multimodal capabilities, and benchmark evaluations.

Hutou Bang doesn't just work on benchmark evaluations; it also conducts model and systems research.

In the video domain, UniVideo attempts to place video understanding, generation, and editing within a single framework, enabling the model not only to generate footage but also to understand content, respond to instructions, and complete edits. Vamba targets long video understanding, addressing memory, computation, and training efficiency issues posed by hour-long videos. MoCha, developed in collaboration with Meta's Generative AI team, focuses on talking virtual character generation, producing high-quality human videos from audio and textual descriptions.

A question setter who never solves problems themselves cannot create good questions. Working on models themselves conversely makes them more suitable for evaluation.

Because truly good evaluation often stems from an understanding of model capability boundaries. Only by knowing how models are built and the problems they encounter in real-world tasks is it easier to design questions that can measure differences and expose issues.

Currently, Chen Wenhu has joined Meta's Superalignment Lab, where his work continues to focus on multimodal pre-training data and evaluation, serving Meta's foundational models.

The AI industry is not short of visible figures. Spotlights typically shine on entrepreneurs, star researchers, and leaders of major model companies. New product launches, funding news, open-source models, and team changes often attract the most external attention, making these names more likely to enter the public eye.

But the involvement of Chinese talent in today's AI field extends far beyond these most prominent positions.

Trending Cryptos

Related Questions

QWho is the 'question setter' behind benchmark evaluations like MMLU-Pro and MMMU, and what is his main contribution?

AThe 'question setter' is Wenhu Chen (Chen Wenhu), an assistant professor at the University of Waterloo. His main contribution is creating and leading the development of influential benchmark evaluations like MMLU-Pro and MMMU/MMMU-Pro, which have become standard tests for evaluating the reasoning and multimodal abilities of large AI models.

QWhy was MMLU-Pro created, and how did it improve upon the original MMLU benchmark?

AMMLU-Pro was created because the original MMLU benchmark became less effective as top models like OpenAI's o3 achieved near-perfect scores, making it hard to distinguish their capabilities. MMLU-Pro improved upon it by expanding answer choices from 4 to 10 to reduce guessing, adding more reasoning-oriented questions, and filtering out simpler or ambiguous questions. This made the test harder (causing accuracy drops of 16-33%) and more stable, effectively differentiating model performance.

QWhat are the key features and purpose of the MMMU benchmark for multimodal AI models?

AThe MMMU benchmark is designed to rigorously evaluate multimodal AI models. Its key features include: 11,500 multimodal questions from academic sources, coverage of 6 major domains and 30 subjects, and questions that require combining visual information (like charts, diagrams, maps) with domain knowledge for reasoning. Its purpose is to test true multimodal understanding and complex reasoning, not just image description. Even top models like GPT-4V initially scored only around 56% accuracy, highlighting significant room for improvement.

QHow does Wenhu Chen's research background and experience contribute to his work on AI benchmarks?

AWenhu Chen's research background in complex information understanding, knowledge QA, and reasoning (e.g., HybridQA, TabFact) gives him a keen eye for how models might 'cheat' or find shortcuts in evaluations. His experience at Google DeepMind working on Gemini provided insider knowledge of model development and evaluation pitfalls. Furthermore, his own lab, TIGERLab, also builds models (like UniVideo for video tasks). This combination—understanding model creation, real-world tasks, and potential evaluation loopholes—makes him particularly adept at designing robust benchmarks that expose true capability gaps.

QWhat is the TIGERLab (or 'Tiger Gang'), and what kind of work does it do besides creating benchmarks?

ATIGERLab (Text and Image GEnerative Research Lab), nicknamed 'Tiger Gang,' is Wenhu Chen's research lab at the University of Waterloo. Besides creating benchmarks, it conducts research on building AI models and systems. Key projects include UniVideo (a unified framework for video understanding, generation, and editing), Vamba (for long video understanding), and MoCha (in collaboration with Meta, for generating talking virtual avatars). This hands-on model development work informs their benchmark design, ensuring the evaluations are grounded in real technical challenges.

Related Reads

No Sales Team, $20 Million in Revenue: How Did AI Employee Viktor Win Over 30,000 Companies?

The AI employee Viktor, developed by a team with DeepMind background, has achieved $20 million in annual revenue without a traditional sales team, serving over 30,000 companies. Its core innovation lies in positioning itself as a "Tier 3 AI Coworker" capable of "end-to-end execution and delivery of results," moving beyond the "draft and wait for human completion" model of typical AI assistants. Users can simply mention Viktor in Slack or Microsoft Teams using natural language commands, and it autonomously performs tasks like pulling sales data from a CRM, generating reports, or even cross-tool operations like creating board meeting PPTs by aggregating data from six different sources. Key to its growth is a pure Product-Led Growth (PLG) model, eliminating complex implementation cycles and per-seat licensing. Instead, it charges based on task credits or consumption, lowering the trial barrier with a $100 free credit offer and no credit card required. This enabled viral, bottom-up adoption within organizations. Viktor's interaction paradigm removes the barrier of prompt engineering, allowing non-technical employees to delegate complex workflows seamlessly. It also features proactive, automated task execution (e.g., overnight bookkeeping, scheduled reports) based on triggers, effectively embedding AI as an automated "process layer" within business operations. However, its expansion into Microsoft Teams—a platform with 320 million users—highlights challenges. Large enterprises require stringent IT compliance, security reviews (e.g., SOC 2), and governance, potentially hindering the frictionless, user-driven adoption that succeeded in Slack. Additionally, the "black box" nature of its autonomous decision-making raises concerns about operational risks, data integrity, and the need for robust audit logs and permission controls. Balancing efficiency gains with security and trust remains a critical hurdle for Viktor and similar AI agents aiming to become core enterprise infrastructure.

marsbit2h ago

No Sales Team, $20 Million in Revenue: How Did AI Employee Viktor Win Over 30,000 Companies?

marsbit2h ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片