Just 6 Days After Launching ChatGPT Health, OpenAI Is Surpassed on Its Own Medical Benchmark

marsbitPublished on 2026-01-14Last updated on 2026-01-14

Abstract

In a significant development in the AI healthcare sector, Baichuan Intelligence has surpassed OpenAI's GPT-5.2 High on the HealthBench benchmark—a medical evaluation dataset created by OpenAI with input from 260+ doctors across 60 countries—just six days after OpenAI launched ChatGPT Health. Baichuan's new model, Baichuan-M3, achieved a top score of 65.1 and also led in the more challenging HealthBench Hard subset, while demonstrating the lowest hallucination rate (3.5%) without relying on external tools. Key to M3’s performance is its Fact Aware RL technique, which improves diagnostic accuracy by balancing factual precision with proactive questioning. The model avoids both over-confident errors and overly vague responses. Additionally, Baichuan introduced SCAN-bench, a new evaluation framework designed to simulate real doctor-patient interactions. In tests, M3 outperformed human specialists in areas like safety stratification, clarity, and diagnostic questioning, partly due to its ability to integrate knowledge across medical disciplines. Baichuan is now rolling out the model via its consumer product Baixiaoying (百小应), offering tailored interfaces for both doctors and patients. The company emphasizes a focus on "serious medicine," prioritizing complex areas like oncology over general wellness, aiming to augment—not just assist—medical professionals. According to CEO Wang Xiaochuan, enhancing AI’s capability in high-stakes medical scenarios is crucial for building user trus...

Author: Li Yuan

Have you ever asked an AI assistant about your health problems?

If you are a heavy user of AI like me, you probably have.

According to OpenAI's own data, health has become one of the most common use cases for ChatGPT, with over 230 million people worldwide asking health and wellness-related questions every week.

Because of this, as we move into 2026, the health field is showing strong signs of becoming a battleground in the AI sector.

On January 7th, OpenAI launched ChatGPT Health, allowing users to connect electronic medical records and various health apps to get more targeted medical responses; and on January 12th, Anthropic immediately launched Claude for Healthcare, emphasizing the new model's capabilities in medical scenarios.

Interestingly, this time, a Chinese company is not lagging behind, and even seems to be taking the lead.

On January 13th, Baichuan Intelligence announced the release of the Baichuan M3 model, which surpassed OpenAI's GPT-5.2 High on HealthBench, a medical and health evaluation test set released by OpenAI, achieving SOTA.

After facing much skepticism for announcing an All-in strategy on healthcare, Baichuan Intelligence seems to have finally proven itself. Geek Park specifically spoke with Wang Xiaochuan to discuss how Baichuan Intelligence views the capabilities of this M3 model and the endgame of AI in healthcare.

01 First to Surpass OpenAI on a Health Domain Test Set

One of the most eye-catching achievements of the newly released M3 model is that it surpassed OpenAI's GPT-5.2 High for the first time on HealthBench, a medical and health evaluation test set released by OpenAI, achieving SOTA.

SOTA On Healthbench, Healthbench Hard and Hallucination Evaluation

Healthbench is a medical and health evaluation test set released by OpenAI in May 2025. It was built by 262 doctors from 60 countries and contains 5000 sets of highly realistic multi-turn medical dialogues. It is one of the most authoritative and clinically realistic medical evaluation sets globally.

Since its release, OpenAI's models have dominated the rankings.

This time, however, Baichuan Intelligence's new generation open-source medical large model, Baichuan-M3, achieved a comprehensive score of 65.1, ranking first globally. It even topped the charts on HealthBench Hard, which specifically tests complex decision-making abilities, setting a new high score.

Baichuan also simultaneously released a hallucination rate test result. The M3 model achieved a hallucination rate of 3.5%, among the lowest globally.

It is worth noting that this hallucination rate is measured in a pure model setting without relying on external retrieval tools.

Baichuan Intelligence stated that the key model improvement enabling these two points is the introduction of a reinforcement learning algorithm suitable for healthcare.

Baichuan首次在M3模型上使用了Fact Aware RL（事实感知强化学习）技术，达到了既让模型不说套话，也不让模型乱说话的效果。

This is actually very critical in the medical field.

When asking medical questions to an unoptimized model, the most common problems are two types: one is the model directly fabricating your symptoms and臆测ing a disease; the other is semantic ambiguity, ultimately suggesting you still need to see a doctor, which isn't very helpful for either doctors or patients.

This is precisely because many models use pure hallucination rate as the optimization target. At this point, the model might dilute the overall hallucination rate by piling up simple, correct facts. Baichuan introduced semantic clustering and importance weighting mechanisms—clustering eliminates interference from redundant expressions, and weighting ensures core medical assertions receive higher weight.

At the same time, if a high-weight hallucination penalty is simply introduced, it极易forces the model into a "say less, make fewer mistakes" conservative strategy. Therefore, the Fact Aware RL algorithm also includes a dynamic weight adjustment mechanism, adaptively balancing these two goals based on the model's current capability level—focusing on medical knowledge learning and expression (high Task Weight) during the capability building phase; and gradually tightening factual constraints (increasing Hallucination Weight) after capabilities mature.

When internet search is available, Baichuan also added an online verification module based on multi-turn search and introduced an efficient caching system for aligning massive medical knowledge.

02 Consultation Level Surpasses Human Doctors, Entering the Usable Stage

However, surpassing OpenAI on Healthbench is not the only highlight this time.

A more interesting point is that Baichuan creatively built its own SCAN-bench evaluation set. Compared to just topping OpenAI's leaderboard, the evaluation set built by Baichuan itself might better illustrate the direction Baichuan Intelligence wants to optimize for in healthcare.

The key point of this evaluation set built by Baichuan is to optimize "end-to-end consultation capability." This stems from an insight from Baichuan's own experiments: for every 2% increase in consultation accuracy, diagnosis accuracy increases by 1%.

That is, compared to OpenAI's HealthBench, which still mainly focuses on "whether the AI can answer questions," Baichuan's SCAN-bench hopes to evaluate: can the AI, in a Q&A process, obtain effective information and simultaneously provide correct diagnosis results and medical advice.

Usually, when we ask an AI assistant a question, if we just mention "you are an experienced doctor," we typically don't get very good model performance. Because a real doctor's consultation process is very standardized—Baichuan归纳izes it into four quadrants of the SCAN principle: Safety Stratification, Clarity Matters, Association & Inquiry, and Normative Protocol.

Centered around the SCAN principle, Baichuan,借鉴ing the OSCE method long used in medical education and collaborating with over 150 frontline doctors, built the SCAN-bench evaluation system. It breaks down the diagnosis process into three stages: medical history collection, auxiliary examination, and precise diagnosis, assessing them through dynamic, multi-turn methods, completely simulating the doctor's process from consultation to diagnosis, and also optimizing the model by achieving better results in these processes.

This time, Baichuan also announced the M3 model's evaluation results on SCAN-bench.

The results are very interesting. Baichuan not only compared with other models this time but also brought in real doctors for comparison. And in the four quadrants, the real doctors have actually fallen behind the level that the model can achieve.

Geek Park specifically asked the Baichuan team about this and received the answer: this evaluation involved real specialist doctors comparing with the model on specialist cases. The model won, firstly, because the model is more patient, but more importantly, the model has better mastery of interdisciplinary knowledge.

For example, in one case involving a 10-year-old child with recurrent fever—fever is a very comprehensive medical phenomenon. If only asking about cough and other lung conditions, it's easy to overlook serious problems in the joints and urinary system, misdiagnosing it as a common infection.

Human doctors are usually only good at conditions within their specialty, which is why complex symptoms often require specialist consultations, or even experts for difficult and complicated diseases often need to consult books and find information.

Ordinary models that haven't been specifically trained, just扮演ing doctors, often struggle to answer such questions well.

03 Next Step: Gradually Start Making C-end Products, Promoting More Serious Healthcare

For Baichuan Intelligence, surpassing human doctors is a very significant milestone: it means AI is starting to cross the usability threshold and can begin to be deployed in usage scenarios.

Starting January 13th, users can already experience the answers provided by the M3 model on the Baixiaoying website and app.

The current website design is very interesting. Although both use the M3 model for answers, they are differentiated into a doctor version and a user version. In the doctor version, the answers are more concise, cite more references, and are more "not speaking in layman's terms." In the ordinary patient version, the model almost never gives an answer all at once; it will ask more follow-up questions and provide a clearer diagnosis.

Baichuan Intelligence mentioned that the model's backend thinking is很有意思. "We often see the model mention in its chain of thought, 'This patient didn't answer my question, but I must ask this question.' We've even seen extreme cases where it says, 'I've already asked the patient 20 rounds, which has exceeded the set maximum number of rounds, but I still have to ask this question.' This is because during training, the model doesn't get rewarded for being slick with its words; it only gets rewarded if it truly obtains enough key information and makes the correct diagnosis. This is a clear difference between how we train our models and how others do."

Many AI companies have recently started介入 the medical field. This is also where Baichuan Intelligence sees its biggest difference—it wants to do more serious healthcare.

"This means that when Baichuan chooses a scenario, it's not about which scenario is easiest to do. On the contrary, Baichuan insists on continuously pushing technological capabilities and challenging more difficult problems," Wang Xiaochuan said.

A typical example is that Baichuan will prioritize solving scenarios in oncology, while psychological healing is lower on Baichuan's priority list.

In popular opinion, it's generally believed that AI providing psychological healing is simpler and an easier scenario to implement. Baichuan's judgment logic is different. They believe the oncology field has stricter scientific basis. Here, AI is more likely to produce serious medical effects,从而达到 or even surpass the level of human doctors. In contrast, the field of psychology lacks this kind of deterministic scientific anchor point.

Another example is that some companies choose to create avatars for doctors. Wang Xiaochuan believes this direction is not what Baichuan wants to do. A doctor's avatar itself cannot fully reuse the doctor's level of ability, let alone surpass it. Such AI can only end up being a幌子 and a customer acquisition tool, not truly promoting serious healthcare.

This insistence on seriousness deeply affects many of Baichuan's business choices.

This is directly related to Wang Xiaochuan's thinking on the fundamental issues of the next stage of medical AI. He believes that the most important task at the current stage is to gradually provide more medical supply based on enhancing AI capabilities.

China has been trying to implement a hierarchical diagnosis and treatment system and a general practitioner system for many years. The original intention was希望老百姓 to first seek medical care at the grassroots level, solving the problems of difficult appointments, long queues, and severe congestion in large hospitals.

The reason this system has been difficult to推行 is essentially due to insufficient supply of medical resources. Grassroots medical institutions lack high-level doctors. People are willing to queue at tertiary hospitals even for a cold because they distrust the diagnostic level at the grassroots level.

This is the key point where medical AI can play a role. Large models can achieve规模化 distribution of top-tier medical knowledge. They fill the supply gap at the grassroots level, allowing every community, every family to possess diagnostic capabilities like experts from tertiary hospitals.

In the long run, this can have a broader impact, potentially shifting the decision-making power in healthcare from doctors to users. In traditional medical scenarios, patients are the beneficiaries but often lack decision-making power. Decision-making power is concentrated in the hands of doctors. This power asymmetry often leads to communication costs and suffering during treatment.

Baichuan hopes that through AI, patients can more easily access the supply of high-quality medical resources. "Many people think medicine is too complex, and patients will never understand it. But we think about the jury system in the US judicial system. Law is also a very professional matter. The ordinary people on the jury don't understand, so it requires the judge, lawyers, and prosecutors to lead, engage in full debate, make things clear to a level where ordinary people can judge guilt or innocence, allowing ordinary people to make normal judgments based on logic," Wang Xiaochuan said.

This is also one of the reasons why Baichuan Intelligence is unwilling to only work on simple scenarios but hopes to continuously advance towards high-difficulty serious diagnosis and treatment.

When asked whether solving high-difficulty problems is the most commercially rewarding, Wang Xiaochuan gave a profound answer.

He believes that solving minor problems like colds and fevers很难 builds sufficient trust in the users' minds. Healthcare is an industry highly dependent on trust. Only when AI can solve high-difficulty problems like serious illnesses can it truly establish a foundation of trust.

From a commercial logic perspective, patients facing serious health problems are also more willing to pay for high-quality AI services. This trust is not only a prerequisite for commercial回报 but also the core for the scalable application of AI in healthcare.

On a more fundamental level, healthcare for Baichuan Intelligence and Wang Xiaochuan personally still represents a path approaching Artificial General Intelligence (AGI).

Wang Xiaochuan believes that AI has already found practical solutions in fields like literature, science, engineering, and art, but healthcare is an extremely unique field. Human exploration of medicine has not been exhausted, and AI is also in a exploratory stage in this field.

Baichuan's roadmap is very clear. First, use AI to improve diagnostic efficiency and solve the current shortage of medical supply. On this basis, Baichuan is committed to建立 deep trust with patients. When patients are willing to use AI tools for long-term medical consultations, AI can accumulate real and high-quality medical data during long-term companionship.

The ultimate goal of this data is to build a mathematical model of life. This is a path that human doctors have not yet fully traversed, and it is highly likely that AI will achieve it first in the future. If modeling the essence of life can be completed, it will become a key step in pushing general artificial intelligence towards higher-level progress.

Trending Cryptos

CitreaCTR

wrapped stUSDTWSTUSDT

Wall Street Rushes into Ethereum, but Has It Become a Castoff for Investors?

Wall Street Embraces Ethereum, But Investors Remain Wary Ethereum is gaining significant institutional adoption from major financial firms like JPMorgan Chase, Robinhood, and Morgan Stanley, which are launching products and building on the network. A new non-profit, Ethereum Institutional, has also been formed to accelerate enterprise use. Technically, Ethereum co-founder Vitalik Buterin has unveiled a major multi-year upgrade plan called "Lean Ethereum," aiming to enhance privacy, security, and quantum resistance. Despite these positive fundamentals, the market reaction has been negative. ETH's price remains around $1,900, down over 60% from its all-time high. Analysts cite weak investor demand and persistent outflows from spot Ethereum ETFs. Citi has downgraded its 12-month price target to $2,240. A key challenge is the decoupling of network activity from ETH's value accrual. While Layer 2 scaling solutions have successfully reduced costs and increased throughput, most transactions now occur off the mainchain. This reduces the fee revenue and ETH burn on Ethereum itself, benefiting users but creating uncertainty for ETH holders. Price predictions are highly divergent, ranging from Citi's conservative $2,000+ outlook to StanChart's $4,000 by 2026 and even a long-term $25,000 forecast from Tom Lee. The consensus is that while Ethereum is increasingly likely to become core financial infrastructure, the extent to which ETH token holders will benefit remains unclear.

marsbit10m ago

Wall Street Rushes into Ethereum, but Has It Become a Castoff for Investors?

marsbit10m ago

Leading South Korean Cryptocurrency Exchanges Upbit and Bithumb Attract Attention with New Listings! Here Are the Latest Altcoins Added

Leading South Korean cryptocurrency exchanges Upbit and Bithumb have expanded their listings with new altcoins. Upbit will commence trading for Conflux ($CFX) in KRW, BTC, and USDT pairs starting July 31 at 16:00. Transactions are supported exclusively via the Conflux eSpace network. Conflux is a Layer 1 blockchain utilizing a Tree-Graph structure and a hybrid PoW/PoS consensus, recently focusing on RWA tokenization and stablecoin payments in Asian markets. Separately, Bithumb activated a KRW trading pair for Gravity Token (GRVT) on the Ethereum network, with spot trading starting July 31 at 14:00. GRVT is the utility token for the GRVT ecosystem, which aims to integrate trading, investment, and payment services, leveraging Validium technology for transaction privacy. Both exchanges noted that trading could be delayed due to insufficient liquidity.

cryptonews.ru20m ago

Leading South Korean Cryptocurrency Exchanges Upbit and Bithumb Attract Attention with New Listings! Here Are the Latest Altcoins Added

cryptonews.ru20m ago

Bitcoin ETFs post $233M inflows, pushing week back into the green

US spot Bitcoin ETFs recorded their strongest daily inflows in over three weeks on Thursday, with $233.1 million in net investments. BlackRock's IBIT led with $183.4 million, followed by Bitwise's BITB and Fidelity's FBTC. These inflows moved the funds into positive territory for the week with $203.84 million in net gains, potentially marking their fourth consecutive week of net inflows. The trend also puts July on track to end a two-month streak of significant outflows. Bitcoin traded at $64,338, while spot Ether ETFs saw $13.3 million in net inflows, with Ether trading at $1,905.

cointelegraph40m ago

Bitcoin ETFs post $233M inflows, pushing week back into the green

cointelegraph40m ago

Major Banks Implement Tokenization in Cross-Border Transfers

Major global banks, including JPMorgan, Citi, and UBS, participated in a pilot project called Agorá, led by the Bank for International Settlements, to test tokenized cross-border payments. The initiative involved five central banks and 28 commercial institutions. The pilot processed approximately $1 million across six currencies (USD, EUR, GBP, JPY, CHF, KRW) using tokenized central bank reserves and commercial bank deposits. A key advantage was processing speed, with payments settling in around 80 seconds on average. The use of a shared ledger provided a single record of ownership and payment status. Another benefit was the ability to perform simultaneous currency clearing, allowing banks to exchange currencies instantly without waiting for separate settlement on each side of the transaction. This reduces risks, lowers costs, and increases transparency. The project operates alongside existing payment systems and does not replace them.

cryptonews.ru46m ago

Major Banks Implement Tokenization in Cross-Border Transfers

cryptonews.ru46m ago

Central Bank Gold Purchases Increased by 62% to 288.9 Tons in the Second Quarter

Institutional investors significantly increased gold purchases in Q2 2026, achieving a quarterly record. According to the World Gold Council (WGC), central bank net buying surged to 288.9 tons, a 62% rise from Q2 2025 (177.9 tons). This represents a sharp recovery from the revised net accumulation of just 57 tons in Q1. The recovery was driven by increased buying from Poland's National Bank and the People's Bank of China, coupled with reduced selling from Turkey and Russia. The National Bank of Poland was the largest buyer in Q2, purchasing 51 tons to progress toward its 700-ton national reserve target. The People's Bank of China made its largest purchases since Q4 2023, raising its official reserves to 2,346 tons, with reports suggesting additional undeclared accumulation. Uzbekistan, Kazakhstan, Jordan, and the Czech Republic also contributed to the demand. Conversely, the Bank of Russia was the largest seller in Q2 at 22 tons, reportedly to cover a federal budget deficit, while Turkey slowed its selling pace. Despite the strong Q2, total central bank demand for 2026 fell to a low of 345 tons due to significant sales from Turkey, Russia, and Azerbaijan earlier in the year. A WGC survey indicates continued momentum, with 89% of central bank respondents expecting to increase gold reserves over the next 12 months.

cryptonews.ru2h ago

Central Bank Gold Purchases Increased by 62% to 288.9 Tons in the Second Quarter

cryptonews.ru2h ago

Trading

Spot

Hot Articles

The Cornerstone of the Autonomous AI Economy: How Talus is Reshaping On-Chain Intelligent Agents

Talus is a decentralized AI Agent framework built on the Sui, designed to solve the structural problems of current AI systems: centralization, opacity, and a lack of native economic identity.

43.4k Total ViewsPublished 2026.03.18Updated 2026.03.18

The Cornerstone of the Autonomous AI Economy: How Talus is Reshaping On-Chain Intelligent Agents

In-depth Analysis of AI and Crypto: The Era of Symbiosis between Algorithms and Ledgers

By 2026, the integration of artificial intelligence and cryptocurrency has advanced from proof-of-concept to a new stage of "system-level integration".

2.9k Total ViewsPublished 2026.03.26Updated 2026.03.26

In-depth Analysis of AI and Crypto: The Era of Symbiosis between Algorithms and Ledgers

U.S. Equity TradFi Assets: Traditional Finance as a Steady Anchor Amid the AI IPO Boom

In 2026, the U.S. IPO market has regained momentum.

36.8k Total ViewsPublished 2026.07.08Updated 2026.07.08

U.S. Equity TradFi Assets: Traditional Finance as a Steady Anchor Amid the AI IPO Boom

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.