Behind the AI Report Card, Lies a Chinese 'Exam Setter'

marsbitPublished on 2026-06-20Last updated on 2026-06-20

Abstract

Beyond the familiar performance charts like MMLU-Pro and MMMU, which major AI models strive to ace, stands a key "examiner": Chinese-Canadian researcher Wenhu Chen. An assistant professor at the University of Waterloo and founder of TIGERLab, Chen addresses the crucial need for more rigorous AI evaluation. As models like GPT-4 began scoring near-perfect results on older benchmarks like MMLU, it became difficult to distinguish their true capabilities. In response, Chen introduced MMLU-Pro in 2024, featuring harder, more reasoning-focused questions with more answer choices, successfully reintroducing meaningful performance gaps. His work extends to multi-modal evaluation with MMMU and its enhanced version, MMMU-Pro. These benchmarks test a model's ability to understand and reason with complex information from images, charts, and text across diverse academic subjects, exposing the significant challenges even top models face in genuine comprehension. Chen's background in complex QA, table reasoning, and his experience at Google DeepMind on projects like Gemini inform his approach. He understands that effective benchmarks must anticipate how models might "cheat" by memorizing data or avoiding visual analysis. His lab also actively researches video understanding and generation models (e.g., UniVideo, Vamba), ensuring his evaluation work is grounded in practical model-building challenges. Now at Meta's Super Intelligence Lab, Chen continues his focus on multi-modal data and evalua...

Each time a cutting-edge model is released, the AI community focuses on a few familiar report cards.

MMLU-Pro, MMMU, MMMU-Pro... These names might sound foreign to ordinary users, but for model companies and researchers, they have almost become the "standard subjects." GPT, Claude, Gemini, Llama, Qwen, DeepSeek continuously submit their answers on these benchmarks.

"The proof is in the pudding." How good a model is often needs to be proven by these scores.

Many performance comparison charts in model launch presentations rely on them; some leaderboards on HuggingFace are also built upon these evaluation systems. It could even be said that today, when the AI industry discusses model capabilities, they are already using a common language defined by these benchmarks.

But interestingly, almost everyone focuses on the scores, yet few know who sets the questions. Behind MMLU-Pro, MMMU, and MMMU-Pro, the same name can be seen—Wenhu Chen.

He is an Assistant Professor in the Department of Computer Science at the University of Waterloo in Canada. On Google Scholar, his papers have been cited over 30,000 times.

He is also the founder of TIGERLab. The English full name of this lab is Text and Image GEnerative Research Lab. Because the Chinese word for "tiger" is in his name, Wenhu Chen gave it a very distinctive Chinese name—Hutou Bang (Tiger Head Gang).

01

After the Old Exam Papers Lost Their Effectiveness

Wenhu Chen first caught wider attention because of MMLU-Pro.

MMLU was once one of the most commonly used benchmark evaluations for assessing the capabilities of large language models. It was like a comprehensive test paper, covering multiple subjects, used to measure a model's performance in knowledge understanding and reasoning tasks.

Early on, this paper was very useful. It could distinguish between models through scores, and the industry could also use it to observe whether large language models were truly improving.

But problems soon emerged.

As model capabilities continuously improved, MMLU gradually became "insufficiently challenging." The scores of cutting-edge models got higher and higher, and the gaps between them became smaller and smaller.

After OpenAI released o3, this problem became even more apparent. The accuracy of o3 on MMLU was already close to 100%, and other cutting-edge models also successively submitted scores approaching full marks.

This might sound like good news, but for evaluation, it actually meant trouble.

If everyone can get close to full marks on an exam paper, it becomes very difficult to continue judging who is stronger and where their strengths lie. It can still prove that models possess certain capabilities, but it is no longer suitable for measuring new progress.

The AI industry needed a harder, less easily "fooled" exam paper.

In 2024, Wenhu Chen and his team launched MMLU-Pro.

MMLU-Pro revamped this exam paper rather than simply expanding the question bank.

It contains 12,032 questions, covering 14 fields including mathematics, physics, chemistry, law, engineering, psychology, and health. Compared to the original MMLU, it expands the options from 4 to 10, reducing the probability of models guessing correctly. It also incorporates more reasoning-oriented questions and cleans up the original question bank of questions that were relatively simple, ambiguous, or lacked sufficient discriminative power.

The effect was direct.

The paper's results showed that model accuracy on MMLU-Pro decreased by 16% to 33% compared to the original MMLU. When the same model was tested under 24 different prompt styles, the score variation also decreased from 4% to 5% in the original MMLU to about 2%.

In other words, this new paper is not only harder but also more stable.

It reopened the gaps between models that all seemed excellent on the old exam paper. It also became easier to tell whether a model truly understands reasoning or is just better at handling old-style questions.

02

Usable Benchmark Evaluations

MMLU-Pro was quickly adopted by the industry.

MMLU-Pro later entered the NeurIPS 2024 Datasets and Benchmarks track and was also integrated into EleutherAI's lm-evaluation-harness framework. For the open-source model community, this meant it was no longer just a dataset in a paper but had entered the common evaluation toolchain.

Many models began reporting MMLU-Pro scores upon release. Some leaderboards on HuggingFace also incorporated it into their evaluation systems.

If MMLU-Pro solved the problem of the "old exam paper losing effectiveness" in language model evaluation, then MMMU pushed Wenhu Chen and TIGERLab to the center of multimodal evaluation.

The problems with multimodal models are more complex.

Language models answer questions, mainly processing text. Multimodal models, however, have to simultaneously process information in different forms like images, charts, diagrams, maps, tables, musical scores, chemical structures, etc. They not only need to understand the question stem but also truly comprehend the content in the images, and reason by integrating visual information, textual information, and domain knowledge.

The MMMU benchmark contains 11,500 multimodal questions sourced from university exams, quizzes, and textbooks, covering six major domains: Arts & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering, further subdivided into 30 subjects and 183 subfields.

These questions are not simply asking the model "what's in the picture." They require the model to combine image information with domain knowledge, much like a student tackling a professional problem.

When MMMU was released, the research team tested 14 open-source multimodal models, as well as representative closed-source models like GPT-4V and Gemini Ultra. Even the strongest closed-source models at the time, GPT-4V and Gemini Ultra, only achieved accuracy rates of 56% and 59% respectively.

These numbers indicate that while multimodal models appear to be progressing rapidly, they still have significant room for improvement when it comes to problems requiring genuine professional understanding and reasoning.

Later, Wenhu Chen's team released MMMU-Pro, further plugging the gaps that allowed models to bypass visual information. It filters out questions that could be answered by text-only models, expands answer choices, and introduces a vision-only setting where questions are embedded within images, requiring the model to perform both visual reading and text comprehension simultaneously.

Simply put, it prevents the model from "guessing the answer just by looking at the text."

This kind of work might sound somewhat tedious, but it is crucial. Because future multimodal models need to enter scenarios like healthcare, education, scientific research, design, and engineering; merely being able to describe a picture is not enough. They must be able to judge, reason, explain, and find the truly useful parts within complex visual information.

03

The People Behind the "Exam Papers"

Wenhu Chen's later work on MMLU-Pro and MMMU stems from his long-standing research direction.

His research interests have always been related to complex information understanding, knowledge question answering, and reasoning.

He earned his bachelor's degree from Huazhong University of Science and Technology, then pursued a master's at RWTH Aachen University in Germany, followed by a Ph.D. in Computer Science from the University of California, Santa Barbara. During his Ph.D., he had already begun research in areas like complex question answering, table reasoning, and knowledge evidence localization.

These tasks share a common characteristic: the answer often does not lie within a single piece of text.

It might be hidden in a table, require combining a piece of text and an image, or might need the model to first retrieve information, then integrate, calculate, and reason. The model cannot just be good at reciting existing knowledge.

Projects Wenhu Chen participated in, such as HybridQA, TabFact, Program of Thoughts, and MAmmoTH, are all related to this line of work.

This also explains his sensitivity to loopholes in model evaluation.

A good benchmark evaluation is not simply about making questions increasingly difficult, but about anticipating where models are most likely to "guess correctly" or "appear competent."

A model might memorize the question bank, guess answers based on options, or use text to bypass visual information... Good evaluation needs to patch these loopholes well.

After his Ph.D., Wenhu Chen joined Google Research and later participated in the development and evaluation of Google DeepMind's Gemini multimodal model from 2021 to 2025. This experience was also important. Long-term exposure to cutting-edge model development gave him a clearer understanding of how model capabilities grow and made it easier to see potential biases and blind spots in evaluation.

In the fall of 2022, Wenhu Chen joined the David R. Cheriton School of Computer Science at the University of Waterloo as an Assistant Professor. The same year, he was selected as a Canada CIFAR AI Chair. Subsequently, he founded "TIGERLab" (aka Hutou Bang), continuing research focused on foundation models, multimodal capabilities, and benchmark evaluations.

Hutou Bang doesn't just work on benchmark evaluations; they also conduct model and system research.

In the video direction, UniVideo attempts to place video understanding, generation, and editing within the same framework, allowing the model not only to generate a sequence of frames but also to understand content, respond to instructions, and complete edits. Vamba targets long video understanding, addressing the memory, computation, and training efficiency challenges posed by hour-long videos. MoCha, a collaboration with Meta's Generative AI team, focuses on talking virtual character generation, producing high-quality character videos from voice and text descriptions.

An exam setter who never takes tests themselves cannot set good questions. Building models themselves, in turn, makes them more suitable for evaluation.

Because truly good evaluation often comes from an understanding of model capability boundaries. Only by knowing how models are built and what problems they encounter in real tasks can one more easily design questions that can differentiate performance and expose weaknesses.

Now, Wenhu Chen has joined Meta's Superalignment Lab, where his work continues to focus on multimodal pretraining data and evaluation, serving Meta's foundation models.

The AI industry does not lack visible figures. Typically, the spotlight falls on entrepreneurs, star researchers, and heads of large model companies. New product launches, funding news, open-source models, and team adjustments often attract the most external attention, making these names more visible to the public.

But today, the participation of Chinese talent in the AI field extends far beyond these most conspicuous positions.

This article is from the WeChat public account "Letters AI", author: Jin Ya

Trending Cryptos

Related Questions

QWho is the key person behind AI benchmark evaluations like MMLU-Pro and MMLU, and what is his background?

AThe key person behind AI benchmarks such as MMLU-Pro and MMLU is Wenhu Chen, an assistant professor in the Computer Science department at the University of Waterloo. He previously worked at Google Research and Google DeepMind on projects like the Gemini multimodal model. He is also the founder of TIGERLab (also known as the 'Tiger Gang'), which focuses on generative AI research for text and images.

QWhy was MMLU-Pro created, and how does it differ from the original MMLU benchmark?

AMMLU-Pro was created because the original MMLU benchmark became less effective as advanced AI models started achieving near-perfect scores, making it difficult to differentiate their capabilities. MMLU-Pro differs by expanding the number of answer choices from 4 to 10, reducing guesswork, and incorporating more reasoning-focused questions. It also removes simpler or ambiguous questions, resulting in a more challenging and stable evaluation that better distinguishes model performance.

QWhat is the MMLU benchmark designed to evaluate, and what challenges did it face over time?

AThe MMLU (Massive Multitask Language Understanding) benchmark is designed to evaluate large language models' knowledge comprehension and reasoning abilities across multiple academic subjects. Over time, as models like OpenAI's o3 achieved near-100% accuracy, MMLU became less effective at distinguishing between top-performing models, leading to the need for a more advanced benchmark like MMLU-Pro.

QWhat is the MMLU benchmark, and how does it assess multimodal AI models?

AThe MMLU (Massive Multidisciplinary Multimodal Understanding) benchmark is designed to assess multimodal AI models by testing their ability to integrate and reason with information from both text and visual inputs (e.g., images, charts, diagrams). It includes 11,500 questions from university exams and textbooks across six major fields, requiring models to combine visual understanding with domain knowledge to solve complex problems.

QHow does Wenhu Chen's work on AI benchmarks relate to his broader research interests and projects?

AWenhu Chen's work on AI benchmarks is closely tied to his broader research focus on complex information understanding, knowledge-based reasoning, and multimodal AI. His involvement in projects like HybridQA, TabFact, and UniVideo reflects his interest in tasks requiring integration of diverse data sources. By developing models himself, he gains insights into their limitations, enabling him to design more effective benchmarks that accurately assess true model capabilities.

Related Reads

Two Legends Lost in Three Days: Is Google's AI Talent Dam Cracking?

In three days, Google lost two AI legends. On June 18, Noam Shazeer, co-author of the seminal "Attention is All You Need" paper and Gemini co-lead, left for OpenAI. Just 48 hours later, John Jumper, 2024 Nobel laureate and AlphaFold lead, departed DeepMind for Anthropic. This follows Andrej Karpathy joining Anthropic in May. These moves highlight a structural trend: top AI talent is concentrating at mission-driven, pre-IPO firms like OpenAI and Anthropic, while Google becomes a primary source. The exodus stems from a core mission mismatch. Google's ad-centric model often subordinates AI research to product and revenue goals, creating friction for pioneers like Shazeer, who returned in 2024 only to leave again. In contrast, OpenAI and Anthropic offer singular focus on pushing AI boundaries, whether towards AGI or safety-aligned models, which deeply appeals to top researchers like Jumper. Financial incentives amplify the pull. With both OpenAI and Anthropic nearing IPO, employees stand to gain immensely from equity, an upside Google's mature stock cannot match. Furthermore, the 2023 merger of Google Brain and DeepMind, intended to consolidate strength, has instead created cultural tension and slowed the path from research to product, as evidenced by Gemini's pace. This talent redistribution is reshaping the AI landscape. While Google retains vast data and compute resources, its true crisis is the quiet, continuous loss of the people who define the field's future. The real moat in AI is not infrastructure, but the concentration of brilliant minds—a battle Google is currently losing.

marsbit1h ago

Two Legends Lost in Three Days: Is Google's AI Talent Dam Cracking?

marsbit1h ago

Alliance Co-founder's Letter to Entrepreneurs: Written at the Moment Cursor Sold for $600 Billion

Alliance Co-founder's Letter to Entrepreneurs: On Cursor's $60 Billion Sale Many aspiring founders see massive exits like Cursor's $60B sale and wonder why they can't achieve the same, often concluding opportunities are exhausted. But great companies aren't built in obvious, crowded spaces. Cursor, like Stripe, Figma, and Shopify before it, started with a non-consensus belief about the future. Before ChatGPT, they believed AI would transform knowledge work. They focused on a genuinely exciting domain, became their own customer, and obsessed over power users. Their journey involved years of "glass-chewing" effort before the market was ready. The pattern is consistent: identify a long-term technological shift, find a missed entry point, and execute for years before the trend becomes obvious. First-generation products (PayPal, Adobe, Amazon) prove a market exists. Second-generation winners (Stripe, Figma, Shopify) rebuild that market around new insights, technology, or changing customer behaviors. Founders must identify their phase in the cycle. Early entrants like Coinbase or Cursor focus on making new technology usable for power users. Later entrants find the "yin" to the established "yang"—the blind spots incumbents miss as they grow distant from individual users. The key is deep market immersion. Use every product in your space. Talk to users. Build an audience. Stop looking for ideas and start *seeing* them everywhere. Then, choose one. The idea must offer a 10x improvement or solve a "hair-on-fire" pain point—something severe enough that users are already crafting workarounds. When building, avoid feature bloat. Ask: why would someone switch? Great startups rarely force new behaviors; they improve familiar workflows with drastically lower friction (e.g., Cursor forked VS Code instead of creating a new editor). Distribution is the underestimated moat. Before product-market fit, achieve distribution-market fit. How do customers discover new tools? Founders like those at Airbnb, Stripe, and Cursor did unscalable, manual work to recruit early users. The final, unteachable ingredient is resilience. Cursor built for years pre-market, faced rejection, and persisted. So did Airbnb, Nvidia, and Rain (which launched post-FTX collapse). The lesson isn't that these founders were smarter, but that they stayed in the game long enough for their insights to compound. Framework: Spot technological cycles. Cultivate unique insight. Obsess over your market. Talk to customers. Find a hair-on-fire problem. Build the simplest wedge. Win your distribution channel. Above all, don't quit when it gets hard. Most people won't do these things consistently. The few who do build the next generation of great companies. Go build.

marsbit1h ago

Alliance Co-founder's Letter to Entrepreneurs: Written at the Moment Cursor Sold for $600 Billion

marsbit1h ago

Weekly Editor's Picks (0613-0619)

Weekly Editor's Picks (0613-0619): Market Insights & Analysis This weekly digest curates in-depth analysis often lost in the information flow, focusing on key insights across macro trends, investment, and technology. **Macro & Geopolitics:** With the Strait of Hormuz reopening and military conflict shifting to negotiation, markets are pivoting from "war shock" to "supply restoration." Trades include shorting crude risk premiums, longing airlines/tourism, Asian energy importers, and bond duration, while shorting inflation expectations. LNG, fertilizer, and chemical chains are also being repriced. **Investment & VC:** Ray Dalio advises against betting on concentrated AI giants dominating indices, advocating for diversified portfolios of high-quality, low-correlation assets instead. Analysis covers the 4-year crypto cycle, predicting the core surviving product by 2029 will be asset trading markets. Current BTC metrics suggest a potential bottoming zone, presenting a patient accumulation window. SpaceX's high-profile IPO at a $2.1T valuation faces scrutiny over fundamentals, with key watchpoints being its likely inclusion in the Nasdaq index and Q2 earnings. Concerns are raised about potential "gamma squeeze" and systemic risks if its narrative-driven valuation gets amplified by passive index funds. Robinhood (HOOD) is noted for breaking its high correlation with crypto, bolstered by its stock trading and new underwriting business. **Web3 & AI:** A warning highlights ~$1.8T in off-balance-sheet AI infrastructure commitments (purchase commitments, leases) as a potential systemic risk if AI monetization lags. AI models are being used for World Cup predictions, adding a new layer for betting markets. A cost breakdown of a $20 AI subscription reveals the supply chain from model companies to cloud, GPUs, and power. **Prediction Markets:** The emergence of prediction market "concept stocks" is noted, with Robinhood developing its own platform, Rothera, signaling a shift from market competition to a "channel war" for user access. **CeFi & DeFi:** The SpaceX IPO tested perpetual contract mechanisms for pre-IPO assets, highlighting challenges in handling corporate actions like stock splits on-chain. The de-pegging of STRC (Strategy's preferred share) to ~$89 reflects market concerns over MicroStrategy's capital structure and BTC-backed leverage model. BlackRock's covered-call Bitcoin ETF (BITA) offers yield but caps upside, appealing to yield-seeking institutions. **Ethereum:** An opinion piece argues Ethereum's core strength is its vast developer community and composability, solidifying its role as the default operating system for the financial internet. **Weekly Hot Topics:** Include the US-Iran deal reopening the Strait of Hormuz, Fed's hawkish hold, Anthropic restricting model access, SpaceX acquiring Cursor, and a humorous stock surge for "Liuliumei" due to its "LLM" ticker.

marsbit1h ago

Weekly Editor's Picks (0613-0619)

marsbit1h ago

Alliance's Co-Founder's Letter to Entrepreneurs: Written on the Occasion of Cursor's $60 Billion Sale

In this letter to entrepreneurs, Alliance reflects on the success of Cursor's $60 billion sale to Elon Musk, using it as a case study to counter the misconception that opportunities in crowded fields like AI or crypto are exhausted. The piece argues that great companies like Cursor, Stripe, Figma, and Shopify are not built by geniuses with perfect ideas, but by founders who start with a non-consensus belief about the future and build for years before that future becomes obvious to everyone. They identify long-term shifts, find overlooked entry points, and execute relentlessly. The framework for success involves: 1. **Identifying your place in the technology cycle**: Early-stage opportunities focus on making new tech usable for power users (e.g., Coinbase, Cursor). Later-stage opportunities involve finding the "yin" to an existing "yang"—the blind spots of first-generation players (e.g., Stripe vs. PayPal, Figma vs. Adobe). 2. **Cultivating unique insights**: Immerse yourself deeply in the market. Use every product, talk to users, and build an audience. Insights will emerge naturally from deep engagement. 3. **Finding a "hair-on-fire" problem**: Look for a 10x improvement or a severe, urgent pain point. The strongest signal is people already building clumsy workarounds. 4. **Building a focused MVP**: Don't just add features because you can. Ask why users would abandon their current tool for yours. The best startups rarely force new behaviors; they improve familiar workflows with drastically lower friction. 5. **Winning a distribution channel**: Distribution is often the moat. Before product-market fit, achieve channel-market fit. Find where your customers are and build an engine to reach them, even through unscalable, manual efforts initially. 6. **Persistence**: The final, unteachable ingredient is resilience. Success stories like Cursor, Airbnb, and Nvidia involved years of grinding, rejection, and perseverance when the path forward seemed unclear. The conclusion is that there is no secret. Most people fail to consistently execute these steps over the long term. The few who do build the companies that define the next era. The world is yours to create.

链捕手2h ago

Alliance's Co-Founder's Letter to Entrepreneurs: Written on the Occasion of Cursor's $60 Billion Sale

链捕手2h ago

Crypto Miners' Big AI Gamble: Valuations Enter Differentiation Stage, Comeback Fight Proves Tough

Crypto Mining Firms' AI Bet: Valuation Divergence and a Challenging Transformation Facing declining profitability in crypto mining, mining companies are pivoting to AI infrastructure, capitalizing on their existing power resources, land, and data center expertise to offer GPU compute power. This transition narrative has boosted their stock prices significantly, with firms like Hut 8 and Bitfarms seeing gains over 100% year-to-date, far outpacing Bitcoin. This has led to a market valuation split, with pioneers like CoreWeave reaching a $62.8B market cap, while others remain below $5B. The market currently prioritizes growth potential over short-term profits, which remain under pressure due to heavy capital expenditures for AI build-outs and crypto asset volatility. However, the transformation is a high-stakes gamble. Bitcoin mining profitability is shrinking, with the average production cost around $63,707 and miner margins contracting. While AI offers a more lucrative long-term path, it requires massive investment—estimated at a $500B near-term funding gap. Success now hinges on execution: delivering on contracted power capacity, securing quality tenants like major cloud providers, and managing the immense financial burden. The valuation focus is shifting from mere power capacity to project delivery, future cash flows, and tenant quality, making this a difficult but critical turnaround attempt.

链捕手2h ago

Crypto Miners' Big AI Gamble: Valuations Enter Differentiation Stage, Comeback Fight Proves Tough

链捕手2h ago

Trading

Spot
Futures

Hot Articles

How to Buy EDGE

Welcome to HTX.com! We've made purchasing edgeX (EDGE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy edgeX (EDGE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your edgeX (EDGE)After purchasing your edgeX (EDGE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade edgeX (EDGE)Easily trade edgeX (EDGE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

3.3k Total ViewsPublished 2026.03.31Updated 2026.06.02

How to Buy EDGE

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of EDGE (EDGE) are presented below.

活动图片