Just 6 Days After Launching ChatGPT Health, OpenAI Is Surpassed on Its Own Medical Benchmark

marsbitPublished on 2026-01-14Last updated on 2026-01-14

Abstract

In a significant development in the AI healthcare sector, Baichuan Intelligence has surpassed OpenAI's GPT-5.2 High on the HealthBench benchmark—a medical evaluation dataset created by OpenAI with input from 260+ doctors across 60 countries—just six days after OpenAI launched ChatGPT Health. Baichuan's new model, Baichuan-M3, achieved a top score of 65.1 and also led in the more challenging HealthBench Hard subset, while demonstrating the lowest hallucination rate (3.5%) without relying on external tools. Key to M3’s performance is its Fact Aware RL technique, which improves diagnostic accuracy by balancing factual precision with proactive questioning. The model avoids both over-confident errors and overly vague responses. Additionally, Baichuan introduced SCAN-bench, a new evaluation framework designed to simulate real doctor-patient interactions. In tests, M3 outperformed human specialists in areas like safety stratification, clarity, and diagnostic questioning, partly due to its ability to integrate knowledge across medical disciplines. Baichuan is now rolling out the model via its consumer product Baixiaoying (百小应), offering tailored interfaces for both doctors and patients. The company emphasizes a focus on "serious medicine," prioritizing complex areas like oncology over general wellness, aiming to augment—not just assist—medical professionals. According to CEO Wang Xiaochuan, enhancing AI’s capability in high-stakes medical scenarios is crucial for building user trus...

Author: Li Yuan

Have you ever asked an AI assistant about your health problems?

If you are a heavy user of AI like me, you probably have.

According to OpenAI's own data, health has become one of the most common use cases for ChatGPT, with over 230 million people worldwide asking health and wellness-related questions every week.

Because of this, as we move into 2026, the health field is showing strong signs of becoming a battleground in the AI sector.

On January 7th, OpenAI launched ChatGPT Health, allowing users to connect electronic medical records and various health apps to get more targeted medical responses; and on January 12th, Anthropic immediately launched Claude for Healthcare, emphasizing the new model's capabilities in medical scenarios.

Interestingly, this time, a Chinese company is not lagging behind, and even seems to be taking the lead.

On January 13th, Baichuan Intelligence announced the release of the Baichuan M3 model, which surpassed OpenAI's GPT-5.2 High on HealthBench, a medical and health evaluation test set released by OpenAI, achieving SOTA.

After facing much skepticism for announcing an All-in strategy on healthcare, Baichuan Intelligence seems to have finally proven itself. Geek Park specifically spoke with Wang Xiaochuan to discuss how Baichuan Intelligence views the capabilities of this M3 model and the endgame of AI in healthcare.

01 First to Surpass OpenAI on a Health Domain Test Set

One of the most eye-catching achievements of the newly released M3 model is that it surpassed OpenAI's GPT-5.2 High for the first time on HealthBench, a medical and health evaluation test set released by OpenAI, achieving SOTA.

SOTA On Healthbench, Healthbench Hard and Hallucination Evaluation

Healthbench is a medical and health evaluation test set released by OpenAI in May 2025. It was built by 262 doctors from 60 countries and contains 5000 sets of highly realistic multi-turn medical dialogues. It is one of the most authoritative and clinically realistic medical evaluation sets globally.

Since its release, OpenAI's models have dominated the rankings.

This time, however, Baichuan Intelligence's new generation open-source medical large model, Baichuan-M3, achieved a comprehensive score of 65.1, ranking first globally. It even topped the charts on HealthBench Hard, which specifically tests complex decision-making abilities, setting a new high score.

Baichuan also simultaneously released a hallucination rate test result. The M3 model achieved a hallucination rate of 3.5%, among the lowest globally.

It is worth noting that this hallucination rate is measured in a pure model setting without relying on external retrieval tools.

Baichuan Intelligence stated that the key model improvement enabling these two points is the introduction of a reinforcement learning algorithm suitable for healthcare.

Baichuan首次在M3模型上使用了Fact Aware RL(事实感知强化学习)技术,达到了既让模型不说套话,也不让模型乱说话的效果。

This is actually very critical in the medical field.

When asking medical questions to an unoptimized model, the most common problems are two types: one is the model directly fabricating your symptoms and臆测ing a disease; the other is semantic ambiguity, ultimately suggesting you still need to see a doctor, which isn't very helpful for either doctors or patients.

This is precisely because many models use pure hallucination rate as the optimization target. At this point, the model might dilute the overall hallucination rate by piling up simple, correct facts. Baichuan introduced semantic clustering and importance weighting mechanisms—clustering eliminates interference from redundant expressions, and weighting ensures core medical assertions receive higher weight.

At the same time, if a high-weight hallucination penalty is simply introduced, it极易forces the model into a "say less, make fewer mistakes" conservative strategy. Therefore, the Fact Aware RL algorithm also includes a dynamic weight adjustment mechanism, adaptively balancing these two goals based on the model's current capability level—focusing on medical knowledge learning and expression (high Task Weight) during the capability building phase; and gradually tightening factual constraints (increasing Hallucination Weight) after capabilities mature.

When internet search is available, Baichuan also added an online verification module based on multi-turn search and introduced an efficient caching system for aligning massive medical knowledge.

02 Consultation Level Surpasses Human Doctors, Entering the Usable Stage

However, surpassing OpenAI on Healthbench is not the only highlight this time.

A more interesting point is that Baichuan creatively built its own SCAN-bench evaluation set. Compared to just topping OpenAI's leaderboard, the evaluation set built by Baichuan itself might better illustrate the direction Baichuan Intelligence wants to optimize for in healthcare.

The key point of this evaluation set built by Baichuan is to optimize "end-to-end consultation capability." This stems from an insight from Baichuan's own experiments: for every 2% increase in consultation accuracy, diagnosis accuracy increases by 1%.

That is, compared to OpenAI's HealthBench, which still mainly focuses on "whether the AI can answer questions," Baichuan's SCAN-bench hopes to evaluate: can the AI, in a Q&A process, obtain effective information and simultaneously provide correct diagnosis results and medical advice.

Usually, when we ask an AI assistant a question, if we just mention "you are an experienced doctor," we typically don't get very good model performance. Because a real doctor's consultation process is very standardized—Baichuan归纳izes it into four quadrants of the SCAN principle: Safety Stratification, Clarity Matters, Association & Inquiry, and Normative Protocol.

Centered around the SCAN principle, Baichuan,借鉴ing the OSCE method long used in medical education and collaborating with over 150 frontline doctors, built the SCAN-bench evaluation system. It breaks down the diagnosis process into three stages: medical history collection, auxiliary examination, and precise diagnosis, assessing them through dynamic, multi-turn methods, completely simulating the doctor's process from consultation to diagnosis, and also optimizing the model by achieving better results in these processes.

This time, Baichuan also announced the M3 model's evaluation results on SCAN-bench.

The results are very interesting. Baichuan not only compared with other models this time but also brought in real doctors for comparison. And in the four quadrants, the real doctors have actually fallen behind the level that the model can achieve.

Geek Park specifically asked the Baichuan team about this and received the answer: this evaluation involved real specialist doctors comparing with the model on specialist cases. The model won, firstly, because the model is more patient, but more importantly, the model has better mastery of interdisciplinary knowledge.

For example, in one case involving a 10-year-old child with recurrent fever—fever is a very comprehensive medical phenomenon. If only asking about cough and other lung conditions, it's easy to overlook serious problems in the joints and urinary system, misdiagnosing it as a common infection.

Human doctors are usually only good at conditions within their specialty, which is why complex symptoms often require specialist consultations, or even experts for difficult and complicated diseases often need to consult books and find information.

Ordinary models that haven't been specifically trained, just扮演ing doctors, often struggle to answer such questions well.

03 Next Step: Gradually Start Making C-end Products, Promoting More Serious Healthcare

For Baichuan Intelligence, surpassing human doctors is a very significant milestone: it means AI is starting to cross the usability threshold and can begin to be deployed in usage scenarios.

Starting January 13th, users can already experience the answers provided by the M3 model on the Baixiaoying website and app.

The current website design is very interesting. Although both use the M3 model for answers, they are differentiated into a doctor version and a user version. In the doctor version, the answers are more concise, cite more references, and are more "not speaking in layman's terms." In the ordinary patient version, the model almost never gives an answer all at once; it will ask more follow-up questions and provide a clearer diagnosis.

Baichuan Intelligence mentioned that the model's backend thinking is很有意思. "We often see the model mention in its chain of thought, 'This patient didn't answer my question, but I must ask this question.' We've even seen extreme cases where it says, 'I've already asked the patient 20 rounds, which has exceeded the set maximum number of rounds, but I still have to ask this question.' This is because during training, the model doesn't get rewarded for being slick with its words; it only gets rewarded if it truly obtains enough key information and makes the correct diagnosis. This is a clear difference between how we train our models and how others do."

Many AI companies have recently started介入 the medical field. This is also where Baichuan Intelligence sees its biggest difference—it wants to do more serious healthcare.

"This means that when Baichuan chooses a scenario, it's not about which scenario is easiest to do. On the contrary, Baichuan insists on continuously pushing technological capabilities and challenging more difficult problems," Wang Xiaochuan said.

A typical example is that Baichuan will prioritize solving scenarios in oncology, while psychological healing is lower on Baichuan's priority list.

In popular opinion, it's generally believed that AI providing psychological healing is simpler and an easier scenario to implement. Baichuan's judgment logic is different. They believe the oncology field has stricter scientific basis. Here, AI is more likely to produce serious medical effects,从而达到 or even surpass the level of human doctors. In contrast, the field of psychology lacks this kind of deterministic scientific anchor point.

Another example is that some companies choose to create avatars for doctors. Wang Xiaochuan believes this direction is not what Baichuan wants to do. A doctor's avatar itself cannot fully reuse the doctor's level of ability, let alone surpass it. Such AI can only end up being a幌子 and a customer acquisition tool, not truly promoting serious healthcare.

This insistence on seriousness deeply affects many of Baichuan's business choices.

This is directly related to Wang Xiaochuan's thinking on the fundamental issues of the next stage of medical AI. He believes that the most important task at the current stage is to gradually provide more medical supply based on enhancing AI capabilities.

China has been trying to implement a hierarchical diagnosis and treatment system and a general practitioner system for many years. The original intention was希望老百姓 to first seek medical care at the grassroots level, solving the problems of difficult appointments, long queues, and severe congestion in large hospitals.

The reason this system has been difficult to推行 is essentially due to insufficient supply of medical resources. Grassroots medical institutions lack high-level doctors. People are willing to queue at tertiary hospitals even for a cold because they distrust the diagnostic level at the grassroots level.

This is the key point where medical AI can play a role. Large models can achieve规模化 distribution of top-tier medical knowledge. They fill the supply gap at the grassroots level, allowing every community, every family to possess diagnostic capabilities like experts from tertiary hospitals.

In the long run, this can have a broader impact, potentially shifting the decision-making power in healthcare from doctors to users. In traditional medical scenarios, patients are the beneficiaries but often lack decision-making power. Decision-making power is concentrated in the hands of doctors. This power asymmetry often leads to communication costs and suffering during treatment.

Baichuan hopes that through AI, patients can more easily access the supply of high-quality medical resources. "Many people think medicine is too complex, and patients will never understand it. But we think about the jury system in the US judicial system. Law is also a very professional matter. The ordinary people on the jury don't understand, so it requires the judge, lawyers, and prosecutors to lead, engage in full debate, make things clear to a level where ordinary people can judge guilt or innocence, allowing ordinary people to make normal judgments based on logic," Wang Xiaochuan said.

This is also one of the reasons why Baichuan Intelligence is unwilling to only work on simple scenarios but hopes to continuously advance towards high-difficulty serious diagnosis and treatment.

When asked whether solving high-difficulty problems is the most commercially rewarding, Wang Xiaochuan gave a profound answer.

He believes that solving minor problems like colds and fevers很难 builds sufficient trust in the users' minds. Healthcare is an industry highly dependent on trust. Only when AI can solve high-difficulty problems like serious illnesses can it truly establish a foundation of trust.

From a commercial logic perspective, patients facing serious health problems are also more willing to pay for high-quality AI services. This trust is not only a prerequisite for commercial回报 but also the core for the scalable application of AI in healthcare.

On a more fundamental level, healthcare for Baichuan Intelligence and Wang Xiaochuan personally still represents a path approaching Artificial General Intelligence (AGI).

Wang Xiaochuan believes that AI has already found practical solutions in fields like literature, science, engineering, and art, but healthcare is an extremely unique field. Human exploration of medicine has not been exhausted, and AI is also in a exploratory stage in this field.

Baichuan's roadmap is very clear. First, use AI to improve diagnostic efficiency and solve the current shortage of medical supply. On this basis, Baichuan is committed to建立 deep trust with patients. When patients are willing to use AI tools for long-term medical consultations, AI can accumulate real and high-quality medical data during long-term companionship.

The ultimate goal of this data is to build a mathematical model of life. This is a path that human doctors have not yet fully traversed, and it is highly likely that AI will achieve it first in the future. If modeling the essence of life can be completed, it will become a key step in pushing general artificial intelligence towards higher-level progress.

Related Questions

QWhat is the significance of Baichuan-M3's performance on OpenAI's HealthBench?

ABaichuan-M3 achieved a state-of-the-art (SOTA) score of 65.1 on OpenAI's HealthBench, surpassing OpenAI's own GPT-5.2 High model. This is significant because HealthBench is a highly authoritative and clinically realistic medical evaluation set, and it marks the first time a model has outperformed OpenAI's on this benchmark.

QWhat key technology did Baichuan use to improve the M3 model's performance in the medical field?

ABaichuan introduced a technology called Fact Aware RL (Reinforcement Learning). This technique uses semantic clustering and importance weighting to reduce redundant expressions and ensure core medical assertions carry more weight. It also features a dynamic weight adjustment mechanism to balance learning medical knowledge with maintaining factual accuracy, preventing the model from being either overly speculative or too conservative.

QHow did the Baichuan-M3 model perform compared to human doctors in the SCAN-bench evaluation?

AIn the SCAN-bench evaluation, which tests end-to-end consultation capabilities, the Baichuan-M3 model outperformed human specialist doctors across all four quadrants (Safety Stratification, Clarity Matters, Association & Inquiry, and Normative Protocol). The model's advantages included greater patience and superior cross-disciplinary knowledge.

QWhat is Baichuan's strategic focus for applying its AI in healthcare, according to the article?

ABaichuan's strategic focus is on 'serious medicine.' They prioritize tackling more challenging and scientifically rigorous medical areas, such as oncology, over simpler applications like psychological therapy. Their goal is to enhance AI capabilities to provide more medical supply, build deep trust with users, and ultimately work towards modeling life itself as a path to AGI.

QHow does Baichuan's approach to medical AI differ from companies that create 'doctor avatars'?

ABaichuan believes that creating 'doctor avatars' merely replicates a doctor's existing level of expertise without surpassing it, potentially making the AI just a tool for customer acquisition. In contrast, Baichuan aims to push the boundaries of AI's technical capabilities to solve harder medical problems, thereby creating new, high-quality medical supply and genuinely advancing serious medicine.

Related Reads

Will the Fed Still Cut Interest Rates? Tonight's Data Is Crucial

The core debate surrounding the Federal Reserve's potential interest rate cuts is intensifying amid geopolitical conflict and rebounding inflation. The key question is whether high energy prices will cause persistent inflation or weaken consumer demand enough to force the Fed to cut rates. Citigroup presents a bullish case for cuts, arguing that oil supply disruptions from the Strait of Hormuz are temporary and will not lead to lasting inflationary pressure. They point to receding bond yields and oil prices as evidence the market is pricing in a short-lived shock. Citi's data also shows tightening financial conditions, a stabilizing labor market, and healthy tax returns, supporting their view that the path to lower rates remains open. Conversely, Deutsche Bank offers a starkly contrasting, more hawkish outlook. They argue the Fed's current policy is already neutral and expect rates to remain unchanged indefinitely. Their view is based on stalled disinflation progress and a shift toward more hawkish rhetoric from key Fed officials like Waller, who cited risks from prolonged Middle East conflict and tariffs. Other officials, including Williams and Hammack, signaled rates would likely stay on hold for a "considerable time." The market pricing has shifted dramatically, now forecasting zero cuts in 2026. The imminent release of the March retail sales "control group" data is highlighted as a critical test. This metric, which excludes gas station sales, will reveal if high gasoline prices are eroding consumer spending in other areas. A weak reading could support the case for imminent rate cuts, while a strong one would bolster the argument for the Fed to hold steady. This data is pivotal for determining the near-term policy path.

marsbit3m ago

Will the Fed Still Cut Interest Rates? Tonight's Data Is Crucial

marsbit3m ago

The Second Half of Macro Influencer Fu Peng's Career

Fu Peng, a prominent Chinese macroeconomist and former chief economist of Northeast Securities, has joined Hong Kong-based digital asset management firm Bitfire Group (formerly New Huo Group) as its chief economist. This move, announced in April 2026, triggered an 11% surge in Bitfire's stock price. Fu, known for his accessible macroeconomic commentary and large social media following, will focus on integrating digital assets into global asset allocation frameworks, particularly combining FICC (fixed income, currencies, and commodities) with cryptocurrencies for institutional clients. His career includes roles at Lehman Brothers and Solomon International, with significant influence gained through public communication. However, in late 2024, Fu faced temporary social media bans after a controversial private speech at HSBC on China's economic challenges, though he denied regulatory sanctions. He later left Northeast Securities citing health reasons. Bitfire, a licensed virtual asset manager serving high-net-worth clients, seeks to build trust and attract traditional capital through Fu’s expertise and credibility. The partnership represents a strategic shift for both: Fu enters the crypto sector after a traditional finance peak, while Bitfire aims to leverage his macro framework for institutional adoption. Outcomes remain uncertain regarding capital inflows and compatibility within corporate structure.

marsbit1h ago

The Second Half of Macro Influencer Fu Peng's Career

marsbit1h ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片