3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbit发布于2026-06-18更新于2026-06-18

文章摘要

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

  • Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
  • Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
  • Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
  • Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

相关问答

QWhat is the name and key characteristic of the small AI model discussed in the article?

AThe model is called VibeThinker-3B. Its key characteristic is that despite being a small model with only 3 billion parameters, it achieves reasoning performance on verifiable tasks like programming that is comparable to much larger frontier models.

QWhich company or team developed the VibeThinker-3B model?

AThe VibeThinker-3B model was developed by the Sina Weibo (microblog) team, making it a domestic Chinese model.

QWhat is the core hypothesis proposed by the creators of VibeThinker-3B regarding model capabilities?

AThe core hypothesis is the 'Parameter-Compression Coverage Hypothesis'. It suggests that different capabilities depend on model parameters in distinct ways. Verifiable reasoning (multi-step reasoning, constraint satisfaction) is highly compressible and parameter-dense. In contrast, open-domain knowledge and understanding rely more on large-scale parameters for broad factual coverage.

QOn which specific benchmark tasks did VibeThinker-3B demonstrate exceptional performance?

AVibeThinker-3B demonstrated exceptional performance on verifiable reasoning benchmarks such as AIME26 (97.1 with CLR), HMMT25 (95.4 with CLR), LiveCodeBench v6 (80.2 Pass@1), and recent private LeetCode contests (96.1% pass rate).

QWhat are the main limitations or scope of application for the VibeThinker-3B model as stated in the article?

AThe model's applicability is limited. It excels in domains with clear verification signals (math, programming, STEM) but does not perform well in areas requiring general world knowledge, open-domain dialogue, or understanding of long-tail scenarios, as these rely on broader parametric coverage.

你可能也喜欢

研报解读:CPO 爆发时,Coherent 在下什么棋

JP摩根分析师Samik Chatterjee在投资者会议上重申了对Coherent的增持评级,认为市场低估了其增长潜力。核心逻辑围绕三条业务线展开。 首先,数据中心对1.6T光收发器的需求持续强劲,价格环境健康。针对市场对CPO技术可能替代传统收发器的担忧,分析师认为CPO集成方案反而会增加对高端光学器件的需求。 其次,在CPO和光路交换市场,Coherent凭借其全面的光学组件产品组合(如激光器、隔离器等)占据优势。每个CPO芯片中,公司可获取的价值远高于传统收发器。其液晶技术的OCS解决方案在可靠性和功耗上优于MEMS方案,目标市场规模达40亿美元。 第三,公司计划两年内将磷化铟器件产能提升四倍,并向上游整合。作为全球两大高质量泵浦激光器供应商之一,公司正从销售器件转向提供完整的线卡或系统,单套方案售价可提升十倍以上。 毛利率方面,公司目标大于42%,并有上调可能,动力来自高端产品溢价、成本结构改善及新产品放量。此外,工业领域业务保持稳定增长,半导体制程设备订单增加,3D传感领域也存在新机会。 总体而言,AI推高算力需求,进而驱动高速光互联需求。Coherent在光通信芯片领域处于关键位置,CPO等新机会、工业业务的稳定增长以及毛利率改善空间共同支撑了其积极前景。

marsbit33分钟前

研报解读:CPO 爆发时,Coherent 在下什么棋

marsbit33分钟前

Dan Koe新文:逃离打工人宿命,如何在AI替代潮中生存下来?

本文探讨在AI技术浪潮下,如何摆脱传统“打工人”宿命,构建自己的事业以实现生存与成长。文章指出,真正的危机并非AI本身,而是将自身幸福完全寄托于他人。作者认为,许多工作因缺乏挑战和自主性,易使人陷入重复与无聊,从而成为“薪水奴隶”。 为在AI时代保持竞争力,个人需掌握五个核心要素:能动性(主动行动)、品味(价值判断)、说服力(影响他人)、坚持(耐受挫折)和迭代(持续优化)。这些能力无法通过单纯学习获得,必须在实践中锻炼。 文章建议,突破的关键在于转变身份,成为“不可被雇佣的人”。具体路径包括:1)彻底改变环境,重塑行为与身份;2)选择能提供真实反馈的载体(如创业),在试错中成长;3)掌握“代码”或“媒体”(内容创作)两项杠杆技能,其中媒体因价值主观、更依赖人的洞察力而更具优势。 最后,作者提出一个可操作的起点:每天抽出15分钟,通过三个步骤启动个人事业:1)挖掘自身独特的兴趣与能力作为“原始素材”;2)找到自己与主流观点不同的“反共识”洞察;3)立即发布第一条内容,获取真实反馈并开始迭代。核心在于行动起来,在创作与实践中学习,逐步建立不依赖雇主的独立事业。

marsbit43分钟前

Dan Koe新文:逃离打工人宿命,如何在AI替代潮中生存下来?

marsbit43分钟前

最顶级的 MEV 机器人,被盗 750 万美元:Approval 才是链上最易忽视的致命风险?

以太坊知名MEV套利机器人Jaredfromsubway.eth近期遭到攻击,损失超750万美元。攻击者并未利用传统漏洞或窃取私钥,而是精心设计了一场“反向围猎”:花费数周时间部署大量伪装成主流代币(如WETH、USDC)的虚假代币和流动性池,构建出看似有利可图的交易路径。机器人在自动化执行过程中,向这些恶意合约授予了ERC-20代币的调用权限(Approval),导致其资产被“合法”转走。 此事件暴露出Approval这一DeFi基础功能的普遍风险。Approval类似“自动扣款授权”,但用户常因追求便利而授予“无限额度”,且授权一旦发出,不会因断开钱包连接或删除DApp而自动失效。即使最初授权的合约是安全的,未来也可能因被攻击或逻辑升级而变危险。 为管控风险,用户应遵循“最小权限”原则,按需授权额度;区分储存与交互用的钱包地址以隔离风险;并定期使用工具(如Revoke.cash或钱包内授权管理功能)检查并撤销不再需要的授权。同时,钱包应用也需增强防护,例如对风险地址进行标记、对授权行为进行结构化提示,推动“所见即所签”成为行业标准,帮助用户在签名前清晰理解操作内容。 总之,链上安全不仅关乎私钥保管,也在于对资产调用权限的持续、审慎管理。

marsbit48分钟前

最顶级的 MEV 机器人,被盗 750 万美元:Approval 才是链上最易忽视的致命风险?

marsbit48分钟前

交易

现货
合约
活动图片