3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbit發佈於 2026-06-18更新於 2026-06-18

文章摘要

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

  • Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
  • Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
  • Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
  • Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

相關問答

QWhat is the name and key characteristic of the small AI model discussed in the article?

AThe model is called VibeThinker-3B. Its key characteristic is that despite being a small model with only 3 billion parameters, it achieves reasoning performance on verifiable tasks like programming that is comparable to much larger frontier models.

QWhich company or team developed the VibeThinker-3B model?

AThe VibeThinker-3B model was developed by the Sina Weibo (microblog) team, making it a domestic Chinese model.

QWhat is the core hypothesis proposed by the creators of VibeThinker-3B regarding model capabilities?

AThe core hypothesis is the 'Parameter-Compression Coverage Hypothesis'. It suggests that different capabilities depend on model parameters in distinct ways. Verifiable reasoning (multi-step reasoning, constraint satisfaction) is highly compressible and parameter-dense. In contrast, open-domain knowledge and understanding rely more on large-scale parameters for broad factual coverage.

QOn which specific benchmark tasks did VibeThinker-3B demonstrate exceptional performance?

AVibeThinker-3B demonstrated exceptional performance on verifiable reasoning benchmarks such as AIME26 (97.1 with CLR), HMMT25 (95.4 with CLR), LiveCodeBench v6 (80.2 Pass@1), and recent private LeetCode contests (96.1% pass rate).

QWhat are the main limitations or scope of application for the VibeThinker-3B model as stated in the article?

AThe model's applicability is limited. It excels in domains with clear verification signals (math, programming, STEM) but does not perform well in areas requiring general world knowledge, open-domain dialogue, or understanding of long-tail scenarios, as these rely on broader parametric coverage.

你可能也喜歡

EF史诗级重组:裁员20%、预算砍半,以太坊要轻装上阵?

以太坊基金会(EF)近日宣布进行大规模组织架构重组,将内部划分为协议层、接入层、用户层、社区层和机构层等多个职能集群。同时,基金会裁员约20%(54名员工),并计划在未来几年削减约40%的预算,年度支出率将从约15%逐步降至2030年后的约5%,向捐赠基金驱动的运营模式转型。 此次改革被视为EF对长期以来角色定位模糊、面临社区诸多质疑(如执行力不足、战略不清晰、持续出售ETH影响市场信心等)的系统性回应。EF明确将工作重心回归到协议研发、公共物品支持和生态协调等核心职能,主动收缩边界,将更多具体的生态建设工作交由市场与独立团队承担。 改革伴随着一些具体调整,例如PSE(隐私与扩容探索)团队将逐步退出,Devcon将转向更小规模模式,对以太坊外大型项目的投入也会减少。与此同时,以太坊生态内正在涌现新的独立组织(如由前EF研究员创立的Ethlabs),以及上市财库公司等力量,共同填补EF收缩后留下的空间,推动生态向更去中心化、协作式的结构演进。 Solana联合创始人toly对此评论表示看好,认为更精简的EF将能更专注、更快速地行动。此次重组标志着EF承认其能力有限,不再试图包办一切,而是推动以太坊生态进入一个由多方力量共同驱动发展的新阶段。

Odaily星球日报20 分鐘前

EF史诗级重组:裁员20%、预算砍半,以太坊要轻装上阵?

Odaily星球日报20 分鐘前

Dragonfly 合伙人 Haseeb: 未来增长最快的公司或都将卡在 149 人

Dragonfly合伙人Haseeb在文章中提出,未来增长最快的公司可能都会刻意将员工规模控制在149人以下,其根本原因在于当前大型AI模型公司(如Anthropic)的企业定价策略。 文章指出,AI公司为小规模团队(通常指少于150人)提供类似“团队订阅”的打包定价模式,用户每月支付固定费用即可获得大量代币额度。在这种模式下,额外使用代币的边际成本为零,极大地鼓励了初创公司和小团队进行大胆的AI实验和应用,最大化利用资源,几乎是在享受一种“创新补贴”。 然而,一旦公司规模超过150人,就必须切换到“企业版”定价。该模式按实际代币使用量收费,且据分析其毛利率高达75%左右。这意味着大型企业每多使用一个代币,都需要支付高昂的附加费用,形成了一种实质上的“AI劳动力税”。 这种定价差异造成了两种截然不同的激励机制: 1. **对小公司(<150人)**:零边际成本激励它们疯狂探索AI自动化,用尽订阅额度,力求以最少的人力创造最大产出。 2. **对大公司(>150人)**:高昂的边际成本抑制了其进行探索性、实验性的AI应用,只愿意将AI用于最成熟、批量化的任务,从而在边际上倾向于保留更多人类员工。 Haseeb将此比作一种由企业制定的“税收政策”,其核心断点就在150人。这可能导致一个奇特的经济现象:最具颠覆性的AI原生公司会想尽办法(如广泛使用AI智能体、频繁外包、严格控编)将正式员工数维持在149人以下,以保持成本优势。而大公司的劳动力替代可能不会以内部“AI裁员”的形式直接出现,而是因其在效率上输给这些灵活的小型竞争对手,导致市场收缩和间接裁员。 最终,文章认为,这种非官方的“定价税”可能在未来十年成为塑造企业形态和市场竞争格局的关键无形之手,促使一批高增长公司集体卡在149人的规模上。

marsbit31 分鐘前

Dragonfly 合伙人 Haseeb: 未来增长最快的公司或都将卡在 149 人

marsbit31 分鐘前

Dragonfly 合伙人 Haseeb: 为何未来增长最快的公司,或都将卡在 149 人

Dragonfly 合伙人 Haseeb 近日撰文分析了以 Anthropic 为代表的大模型公司定价策略可能引发的深远影响。文章指出,这类公司对小型企业(团队规模150人以下)采用类似健身房会员的“团队订阅”模式,用户只需支付固定月费即可使用大量代币,边际成本几乎为零;但对150人以上的大型企业,则强制切换到“企业版”,需按代币用量支付高昂的 API 费用,其毛利率可能高达75%。 Haseeb 将这种定价差异类比为一种“税收政策”:对初创公司补贴创新(边际税率为0),而对大企业则征收高额“AI劳动力税”。这导致了截然不同的激励效果: * **初创公司**:有极强动机最大化利用订阅额度(tokenmaxxing),疯狂探索和自动化,力求成为让大模型公司在订阅上“亏损”最多的用户。 * **大型企业**:由于每个额外代token都成本不菲,它们只会自动化最明显、批量大的任务,而抑制了边际性、实验性的自动化探索,倾向于保留更多人类岗位。 这种结构意味着,AI对劳动力的替代可能不会以“大公司直接用AI裁员”的形式大规模出现,而是表现为:**获得补贴的、高度自动化的AI原生初创公司,在市场竞争中击败背负高额“AI税”的大企业。** 大企业的裁员可能源于业务衰退(并美其名为“AI增效”),而获胜的初创公司永远不会重建那些旧岗位。 文章进一步指出,150人的定价“断点”可能像法国50人劳工法规一样,成为一个关键的“监管断点”(Notch),**强烈激励公司把规模控制在149人以内**,以维持低廉的AI使用成本。这或将催生一种全新的“AI优先”管理哲学:公司极度痴迷于用智能体替代人力,保持极小团队规模,以实现效率最大化。 Haseeb 总结道,代币定价策略虽非有意设计,但其影响堪比税收政策,可能在未来十年深刻塑造经济形态,决定哪些公司能够崛起以及它们如何组织自身。未来增长最快的公司,或许都将“卡在149人”的规模上。

链捕手40 分鐘前

Dragonfly 合伙人 Haseeb: 为何未来增长最快的公司,或都将卡在 149 人

链捕手40 分鐘前

交易

現貨
合約
活动图片