Behind the AI Report Card, Lies a Chinese 'Exam Setter'

marsbitPublicado a 2026-06-20Actualizado a 2026-06-20

Resumen

Beyond the familiar performance charts like MMLU-Pro and MMMU, which major AI models strive to ace, stands a key "examiner": Chinese-Canadian researcher Wenhu Chen. An assistant professor at the University of Waterloo and founder of TIGERLab, Chen addresses the crucial need for more rigorous AI evaluation. As models like GPT-4 began scoring near-perfect results on older benchmarks like MMLU, it became difficult to distinguish their true capabilities. In response, Chen introduced MMLU-Pro in 2024, featuring harder, more reasoning-focused questions with more answer choices, successfully reintroducing meaningful performance gaps. His work extends to multi-modal evaluation with MMMU and its enhanced version, MMMU-Pro. These benchmarks test a model's ability to understand and reason with complex information from images, charts, and text across diverse academic subjects, exposing the significant challenges even top models face in genuine comprehension. Chen's background in complex QA, table reasoning, and his experience at Google DeepMind on projects like Gemini inform his approach. He understands that effective benchmarks must anticipate how models might "cheat" by memorizing data or avoiding visual analysis. His lab also actively researches video understanding and generation models (e.g., UniVideo, Vamba), ensuring his evaluation work is grounded in practical model-building challenges. Now at Meta's Super Intelligence Lab, Chen continues his focus on multi-modal data and evalua...

Each time a cutting-edge model is released, the AI community focuses on a few familiar report cards.

MMLU-Pro, MMMU, MMMU-Pro... These names might sound foreign to ordinary users, but for model companies and researchers, they have almost become the "standard subjects." GPT, Claude, Gemini, Llama, Qwen, DeepSeek continuously submit their answers on these benchmarks.

"The proof is in the pudding." How good a model is often needs to be proven by these scores.

Many performance comparison charts in model launch presentations rely on them; some leaderboards on HuggingFace are also built upon these evaluation systems. It could even be said that today, when the AI industry discusses model capabilities, they are already using a common language defined by these benchmarks.

But interestingly, almost everyone focuses on the scores, yet few know who sets the questions. Behind MMLU-Pro, MMMU, and MMMU-Pro, the same name can be seen—Wenhu Chen.

He is an Assistant Professor in the Department of Computer Science at the University of Waterloo in Canada. On Google Scholar, his papers have been cited over 30,000 times.

He is also the founder of TIGERLab. The English full name of this lab is Text and Image GEnerative Research Lab. Because the Chinese word for "tiger" is in his name, Wenhu Chen gave it a very distinctive Chinese name—Hutou Bang (Tiger Head Gang).

01

After the Old Exam Papers Lost Their Effectiveness

Wenhu Chen first caught wider attention because of MMLU-Pro.

MMLU was once one of the most commonly used benchmark evaluations for assessing the capabilities of large language models. It was like a comprehensive test paper, covering multiple subjects, used to measure a model's performance in knowledge understanding and reasoning tasks.

Early on, this paper was very useful. It could distinguish between models through scores, and the industry could also use it to observe whether large language models were truly improving.

But problems soon emerged.

As model capabilities continuously improved, MMLU gradually became "insufficiently challenging." The scores of cutting-edge models got higher and higher, and the gaps between them became smaller and smaller.

After OpenAI released o3, this problem became even more apparent. The accuracy of o3 on MMLU was already close to 100%, and other cutting-edge models also successively submitted scores approaching full marks.

This might sound like good news, but for evaluation, it actually meant trouble.

If everyone can get close to full marks on an exam paper, it becomes very difficult to continue judging who is stronger and where their strengths lie. It can still prove that models possess certain capabilities, but it is no longer suitable for measuring new progress.

The AI industry needed a harder, less easily "fooled" exam paper.

In 2024, Wenhu Chen and his team launched MMLU-Pro.

MMLU-Pro revamped this exam paper rather than simply expanding the question bank.

It contains 12,032 questions, covering 14 fields including mathematics, physics, chemistry, law, engineering, psychology, and health. Compared to the original MMLU, it expands the options from 4 to 10, reducing the probability of models guessing correctly. It also incorporates more reasoning-oriented questions and cleans up the original question bank of questions that were relatively simple, ambiguous, or lacked sufficient discriminative power.

The effect was direct.

The paper's results showed that model accuracy on MMLU-Pro decreased by 16% to 33% compared to the original MMLU. When the same model was tested under 24 different prompt styles, the score variation also decreased from 4% to 5% in the original MMLU to about 2%.

In other words, this new paper is not only harder but also more stable.

It reopened the gaps between models that all seemed excellent on the old exam paper. It also became easier to tell whether a model truly understands reasoning or is just better at handling old-style questions.

02

Usable Benchmark Evaluations

MMLU-Pro was quickly adopted by the industry.

MMLU-Pro later entered the NeurIPS 2024 Datasets and Benchmarks track and was also integrated into EleutherAI's lm-evaluation-harness framework. For the open-source model community, this meant it was no longer just a dataset in a paper but had entered the common evaluation toolchain.

Many models began reporting MMLU-Pro scores upon release. Some leaderboards on HuggingFace also incorporated it into their evaluation systems.

If MMLU-Pro solved the problem of the "old exam paper losing effectiveness" in language model evaluation, then MMMU pushed Wenhu Chen and TIGERLab to the center of multimodal evaluation.

The problems with multimodal models are more complex.

Language models answer questions, mainly processing text. Multimodal models, however, have to simultaneously process information in different forms like images, charts, diagrams, maps, tables, musical scores, chemical structures, etc. They not only need to understand the question stem but also truly comprehend the content in the images, and reason by integrating visual information, textual information, and domain knowledge.

The MMMU benchmark contains 11,500 multimodal questions sourced from university exams, quizzes, and textbooks, covering six major domains: Arts & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering, further subdivided into 30 subjects and 183 subfields.

These questions are not simply asking the model "what's in the picture." They require the model to combine image information with domain knowledge, much like a student tackling a professional problem.

When MMMU was released, the research team tested 14 open-source multimodal models, as well as representative closed-source models like GPT-4V and Gemini Ultra. Even the strongest closed-source models at the time, GPT-4V and Gemini Ultra, only achieved accuracy rates of 56% and 59% respectively.

These numbers indicate that while multimodal models appear to be progressing rapidly, they still have significant room for improvement when it comes to problems requiring genuine professional understanding and reasoning.

Later, Wenhu Chen's team released MMMU-Pro, further plugging the gaps that allowed models to bypass visual information. It filters out questions that could be answered by text-only models, expands answer choices, and introduces a vision-only setting where questions are embedded within images, requiring the model to perform both visual reading and text comprehension simultaneously.

Simply put, it prevents the model from "guessing the answer just by looking at the text."

This kind of work might sound somewhat tedious, but it is crucial. Because future multimodal models need to enter scenarios like healthcare, education, scientific research, design, and engineering; merely being able to describe a picture is not enough. They must be able to judge, reason, explain, and find the truly useful parts within complex visual information.

03

The People Behind the "Exam Papers"

Wenhu Chen's later work on MMLU-Pro and MMMU stems from his long-standing research direction.

His research interests have always been related to complex information understanding, knowledge question answering, and reasoning.

He earned his bachelor's degree from Huazhong University of Science and Technology, then pursued a master's at RWTH Aachen University in Germany, followed by a Ph.D. in Computer Science from the University of California, Santa Barbara. During his Ph.D., he had already begun research in areas like complex question answering, table reasoning, and knowledge evidence localization.

These tasks share a common characteristic: the answer often does not lie within a single piece of text.

It might be hidden in a table, require combining a piece of text and an image, or might need the model to first retrieve information, then integrate, calculate, and reason. The model cannot just be good at reciting existing knowledge.

Projects Wenhu Chen participated in, such as HybridQA, TabFact, Program of Thoughts, and MAmmoTH, are all related to this line of work.

This also explains his sensitivity to loopholes in model evaluation.

A good benchmark evaluation is not simply about making questions increasingly difficult, but about anticipating where models are most likely to "guess correctly" or "appear competent."

A model might memorize the question bank, guess answers based on options, or use text to bypass visual information... Good evaluation needs to patch these loopholes well.

After his Ph.D., Wenhu Chen joined Google Research and later participated in the development and evaluation of Google DeepMind's Gemini multimodal model from 2021 to 2025. This experience was also important. Long-term exposure to cutting-edge model development gave him a clearer understanding of how model capabilities grow and made it easier to see potential biases and blind spots in evaluation.

In the fall of 2022, Wenhu Chen joined the David R. Cheriton School of Computer Science at the University of Waterloo as an Assistant Professor. The same year, he was selected as a Canada CIFAR AI Chair. Subsequently, he founded "TIGERLab" (aka Hutou Bang), continuing research focused on foundation models, multimodal capabilities, and benchmark evaluations.

Hutou Bang doesn't just work on benchmark evaluations; they also conduct model and system research.

In the video direction, UniVideo attempts to place video understanding, generation, and editing within the same framework, allowing the model not only to generate a sequence of frames but also to understand content, respond to instructions, and complete edits. Vamba targets long video understanding, addressing the memory, computation, and training efficiency challenges posed by hour-long videos. MoCha, a collaboration with Meta's Generative AI team, focuses on talking virtual character generation, producing high-quality character videos from voice and text descriptions.

An exam setter who never takes tests themselves cannot set good questions. Building models themselves, in turn, makes them more suitable for evaluation.

Because truly good evaluation often comes from an understanding of model capability boundaries. Only by knowing how models are built and what problems they encounter in real tasks can one more easily design questions that can differentiate performance and expose weaknesses.

Now, Wenhu Chen has joined Meta's Superalignment Lab, where his work continues to focus on multimodal pretraining data and evaluation, serving Meta's foundation models.

The AI industry does not lack visible figures. Typically, the spotlight falls on entrepreneurs, star researchers, and heads of large model companies. New product launches, funding news, open-source models, and team adjustments often attract the most external attention, making these names more visible to the public.

But today, the participation of Chinese talent in the AI field extends far beyond these most conspicuous positions.

This article is from the WeChat public account "Letters AI", author: Jin Ya

Criptos en tendencia

Preguntas relacionadas

QWho is the key person behind AI benchmark evaluations like MMLU-Pro and MMLU, and what is his background?

AThe key person behind AI benchmarks such as MMLU-Pro and MMLU is Wenhu Chen, an assistant professor in the Computer Science department at the University of Waterloo. He previously worked at Google Research and Google DeepMind on projects like the Gemini multimodal model. He is also the founder of TIGERLab (also known as the 'Tiger Gang'), which focuses on generative AI research for text and images.

QWhy was MMLU-Pro created, and how does it differ from the original MMLU benchmark?

AMMLU-Pro was created because the original MMLU benchmark became less effective as advanced AI models started achieving near-perfect scores, making it difficult to differentiate their capabilities. MMLU-Pro differs by expanding the number of answer choices from 4 to 10, reducing guesswork, and incorporating more reasoning-focused questions. It also removes simpler or ambiguous questions, resulting in a more challenging and stable evaluation that better distinguishes model performance.

QWhat is the MMLU benchmark designed to evaluate, and what challenges did it face over time?

AThe MMLU (Massive Multitask Language Understanding) benchmark is designed to evaluate large language models' knowledge comprehension and reasoning abilities across multiple academic subjects. Over time, as models like OpenAI's o3 achieved near-100% accuracy, MMLU became less effective at distinguishing between top-performing models, leading to the need for a more advanced benchmark like MMLU-Pro.

QWhat is the MMLU benchmark, and how does it assess multimodal AI models?

AThe MMLU (Massive Multidisciplinary Multimodal Understanding) benchmark is designed to assess multimodal AI models by testing their ability to integrate and reason with information from both text and visual inputs (e.g., images, charts, diagrams). It includes 11,500 questions from university exams and textbooks across six major fields, requiring models to combine visual understanding with domain knowledge to solve complex problems.

QHow does Wenhu Chen's work on AI benchmarks relate to his broader research interests and projects?

AWenhu Chen's work on AI benchmarks is closely tied to his broader research focus on complex information understanding, knowledge-based reasoning, and multimodal AI. His involvement in projects like HybridQA, TabFact, and UniVideo reflects his interest in tasks requiring integration of diverse data sources. By developing models himself, he gains insights into their limitations, enabling him to design more effective benchmarks that accurately assess true model capabilities.

Lecturas Relacionadas

Optical Chips: Collective Capacity Expansion

The global optical chip industry is experiencing a massive wave of expansion driven by surging AI data center demand. Major players across the US, Japan, Europe, and China are aggressively investing to ramp up production capacity. In the US, Coherent is expanding its 6-inch Indium Phosphide (InP) semiconductor fab in Texas, supported by CHIPS Act funding and a $2 billion strategic investment from NVIDIA. Lumentum is building a new factory for InP optical devices, and Nokia is scaling its advanced photonic chip packaging and testing capabilities. NVIDIA's investments aim to secure future supply of critical lasers and optical interconnect products for AI infrastructure. Japan's JX Advanced Metals, a leading InP substrate supplier, plans a multi-billion yen investment to increase its capacity 7-10 times, strengthening its grip on the crucial upstream materials market. In Europe, IQE and Tower Semiconductor settled a patent dispute and signed a multi-year InP epitaxial wafer supply agreement, highlighting that next-generation silicon photonics platforms will integrate high-performance InP components. STMicroelectronics and Sivers Semiconductors are also expanding silicon photonics production and partnerships. China is rapidly building out its domestic supply chain. Dongshan Precision's subsidiary, Source Photonics, announced a $12 billion project to expand optical chip and module production. Companies like Sanan Optoelectronics and Yunnan Germanium are scaling up InP chip manufacturing and substrate production, moving towards vertical integration from materials to modules. While debate continues around the exact future architecture—whether CPO (Co-Packaged Optics), NPO, or pluggables will dominate—analysts like Morgan Stanley argue the underlying driver is unchangeable: the explosive growth in bandwidth demand. This will inevitably increase the volume of optical engines, lasers, and related content per GPU, regardless of the final technical path. The competition for "more light" in the AI era has intensified into a global, full-chain capacity race.

marsbitHace 40 min(s)

Optical Chips: Collective Capacity Expansion

marsbitHace 40 min(s)

Stablecoins Finally Find Real Yield: An In-Depth Look at On-Chain Reinsurance Re | A Conversation with Re Founder Karan Saroya

Stablecoin Real Yield Found: A Deep Dive into On-Chain Reinsurance with Re's Karan Saroya As stablecoin supply exceeds $170 billion, the search for sustainable, non-speculative yield intensifies. Re, an on-chain reinsurance platform, provides an answer: connecting stablecoin capital to the trillion-dollar traditional reinsurance market. Re operates as a regulated reinsurer, accepting stablecoin deposits as collateral to back US insurance companies. These insurers pay premiums, generating yield that flows back to on-chain depositors. Currently supporting 35 insurers and underwriting $500 million, Re projects scaling to over $1 billion soon. Key insights from a Bankless podcast with founder Karan Saroya and investor Avichal of Electric Capital: 1. **Uncorrelated, Real-World Yield:** Re offers stablecoin holders access to reinsurance returns (targeting 12-14%+), an asset class entirely separate from crypto or equity markets. 2. **Operational Efficiency via Smart Contracts:** Re replaces traditional, labor-intensive capital fundraising with smart contracts, allowing a ~12-person team to compete with industry giants. 3. **Regulatory Leverage:** For every $1 of collateral, regulations allow backing $5-7 in written premiums. This leverage amplifies returns from the underlying risk-free rate. 4. **DeFi Integration:** Depositors receive receipt tokens, which can be used in protocols like Morpho for "looping," potentially pushing yields to 18-20%+. 5. **The "DeFi Mullet" Model:** A compliant front-end (regulated reinsurer) paired with a decentralized back-end (smart contracts, DeFi capital markets). 6. **RE Governance Token:** Modeled on Lloyd's of London, the token governs the central capital pool's allocation, counterparty acceptance, and parameters. 7. **Real Economic Impact:** Capital funds real-world productivity (factories, clinics, businesses) via insurance, moving beyond crypto's internal loops. The discussion highlights a pivotal moment: DeFi's supply-side infrastructure is now met by real demand for productive yield, potentially kickstarting a flywheel where vast on-chain stablecoin capital seeks these real-world returns.

链捕手Hace 2 hora(s)

Stablecoins Finally Find Real Yield: An In-Depth Look at On-Chain Reinsurance Re | A Conversation with Re Founder Karan Saroya

链捕手Hace 2 hora(s)

1996 or 1999? Walsh's First Test is 'How to View AI'

"1996 or 1999? Wall's First Big Test Is 'How to View AI'" Federal Reserve Chairman Wall's initial challenge is not whether to raise or cut rates, but a more fundamental judgment: what kind of boom is the current AI boom? This will determine the Fed's policy path and define his legacy. Economics is split between two opposing views, according to reporter Nick Timiraos. One sees imminent productivity gains that will increase supply and cool inflation, allowing the Fed to hold steady. The other argues that while productivity benefits are distant, demand shocks are here now, and waiting for data confirmation risks missing the intervention window, forcing sharper rate hikes later. Wall has signaled a leaning toward the first view, echoing 1996-era Alan Greenspan, who embraced strong, productivity-driven growth without fear of inflation. However, Wall faces a different macro environment than Greenspan did, with tariff pressures, expanding fiscal deficits, and diminishing globalization benefits, which could force more significant inflation pressures even if AI benefits materialize. Wall's logic, expressed before taking office, is that AI-driven productivity gains won't show in official data for years. If the Fed waits for confirmation, it might mistakenly tighten policy and choke off the very growth that could suppress inflation. This argues for using forward-looking narratives over lagging data. Chicago Fed President Austan Goolsbee presents a key counter-argument. He distinguishes between expected and unexpected productivity booms. A widely anticipated boom, like the current AI wave, can cause people to spend future wealth gains in advance, overheating the economy before productivity actually rises, thus requiring preemptive rate hikes. He cites rising costs for AI data centers as evidence of such overheating. Fed Governor Christopher Waller offers a rebuttal to Goolsbee, noting the "expected spending" mechanism only works if people can borrow against future income, which many households cannot do due to borrowing constraints. Wall also faces a paradox related to his desire to reduce the Fed's use of "forward guidance" (pre-announcing policy moves). This practice was established in 1999 when Greenspan began signaling hikes to avoid market shocks. If the economy follows a less optimistic path, Wall may be forced to choose between using the guidance he wants to abolish or risking market volatility by staying silent. The ultimate question defining Wall's first major test remains: Is this 1996 or 1999?

marsbitHace 3 hora(s)

1996 or 1999? Walsh's First Test is 'How to View AI'

marsbitHace 3 hora(s)

Ethereum Q1 2026 Report: Fees Decline, Users and Transaction Volume Hit New Highs

Ethereum Q1 2026 Report: Fees Down, Users & Transactions Hit New Highs Token Terminal's Q1 2026 report on Ethereum presents a pivotal development: the network achieved record highs in monthly active users (13.2M, +85.9% YoY), total transactions (200.4M, +81.5% YoY), and throughput (25.78 TPS), while transaction fees on the mainnet plummeted by 47.9% quarter-over-quarter. This shift is attributed to the network's strategic move into a "low fees for scale" phase, exemplified by the Fusaka upgrade which increased data capacity and lowered block space costs, releasing pent-up demand (a manifestation of Jevons's Paradox). The report highlights a core narrative shift for Ethereum: from a DeFi-centric blockchain to a global financial settlement layer. It maintains a dominant position in tokenized assets, holding majority market shares among top chains in stablecoins (61.8%), tokenized funds (73.0%), and tokenized commodities (84.0%). Growth in tokenized funds (+73.1% YoY) and commodities (+325.9% YoY) was particularly strong, driven by institutions like BlackRock and JPMorgan entering the space. Contrasting these usage gains, several USD-denominated value metrics declined in Q1: fully diluted market cap fell 30.3% QoQ, total value locked (TVL) dropped 11.0%, and ecosystem transaction volume decreased 24.0%. The report interprets this as Ethereum prioritizing long-term network expansion and cementing its role as the default settlement layer for finance over short-term fee capture. The commentary from Etherealize argues that, much like the early internet, Ethereum's open, permissionless model is poised to win over closed alternatives as institutional tokenization accelerates.

marsbitHace 4 hora(s)

Ethereum Q1 2026 Report: Fees Decline, Users and Transaction Volume Hit New Highs

marsbitHace 4 hora(s)

Trading

Spot
Futuros

Artículos destacados

Cómo comprar EDGE

¡Bienvenido a HTX.com! Hemos hecho que comprar edgeX (EDGE) sea simple y conveniente. Sigue nuestra guía paso a paso para iniciar tu viaje de criptos.Paso 1: crea tu cuenta HTXUtiliza tu correo electrónico o número de teléfono para registrarte y obtener una cuenta gratuita en HTX. Experimenta un proceso de registro sin complicaciones y desbloquea todas las funciones.Obtener mi cuentaPaso 2: ve a Comprar cripto y elige tu método de pagoTarjeta de crédito/débito: usa tu Visa o Mastercard para comprar edgeX (EDGE) al instante.Saldo: utiliza fondos del saldo de tu cuenta HTX para tradear sin problemas.Terceros: hemos agregado métodos de pago populares como Google Pay y Apple Pay para mejorar la comodidad.P2P: tradear directamente con otros usuarios en HTX.Over-the-Counter (OTC): ofrecemos servicios personalizados y tipos de cambio competitivos para los traders.Paso 3: guarda tu edgeX (EDGE)Después de comprar tu edgeX (EDGE), guárdalo en tu cuenta HTX. Alternativamente, puedes enviarlo a otro lugar mediante transferencia blockchain o utilizarlo para tradear otras criptomonedas.Paso 4: tradear edgeX (EDGE)Tradear fácilmente con edgeX (EDGE) en HTX's mercado spot. Simplemente accede a tu cuenta, selecciona tu par de trading, ejecuta tus trades y monitorea en tiempo real. Ofrecemos una experiencia fácil de usar tanto para principiantes como para traders experimentados.

548 Vistas totalesPublicado en 2026.03.31Actualizado en 2026.06.02

Cómo comprar EDGE

Discusiones

Bienvenido a la comunidad de HTX. Aquí puedes mantenerte informado sobre los últimos desarrollos de la plataforma y acceder a análisis profesionales del mercado. A continuación se presentan las opiniones de los usuarios sobre el precio de EDGE (EDGE).

活动图片