3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbit发布于2026-06-18更新于2026-06-18

文章摘要

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

你可能也喜欢

SemiAnalysis 万字拆解长鑫存储：500 亿美元营收，超级周期里的 IPO

长鑫存储即将在科创板上市，有望成为中国史上最大的半导体IPO。公司成立于2016年，通过收购破产的德国DRAM厂商奇梦达的专利与技术文档起步，并吸引了包括奇梦达、美光等公司的顶尖人才。在合肥市政府“耐心资本”近十年的持续输血与产业链支持下，长鑫于2025年首次实现年度盈利。 2026年，长鑫业绩迎来爆发式增长，一季度单季营收达73亿美元，预计全年营收可能超过500亿美元。这主要得益于当前DRAM行业的“超级周期”带来的价格上涨，而非市场份额的显著提升。按产能计，长鑫已是全球第四大DRAM厂商，并正在快速逼近美光。然而，公司在高端HBM（高带宽内存）领域面临巨大挑战。其HBM技术仍不成熟，良率远低于行业巨头，且当前将稀缺产能分配给利润率更高的大宗DRAM产品在商业上更为合理。但由于美国出口管制限制先进HBM对华销售，中国对AI算力自主可控的强烈需求可能推动长鑫未来不得不加速HBM产能布局。此次IPO计划募资约295亿元人民币，主要用于现有DRAM产能升级与研发，并未明确提及HBM项目。其复杂的股权结构显示，国有资本合计持股超过30%，是公司的核心支持力量。分析认为，当前IPO估值可能被显著低估。随着阿里巴巴等国内核心客户兼股东的支持，长鑫在国内市场需求方面具备独特优势，但与三星、SK海力士和美光的竞争将愈发激烈。

marsbit3分钟前

SemiAnalysis 万字拆解长鑫存储：500 亿美元营收，超级周期里的 IPO

marsbit3分钟前

从Corning到Ciena，AI光通信链条里的10倍股机会

本文分析了AI数据中心背景下，光通信产业链的投资机会。随着数据传输从800G向1.6T、3.2T升级，铜缆面临物理极限，光通信因距离更远、发热更低、能耗更小成为必然选择。文章指出，最大的投资机会往往在于整个产业链中不可或缺的供应商，而非单一明星公司。文章重点梳理了产业链关键环节的代表公司： 1. **光纤与玻璃层**：**康宁（Corning）** 作为核心供应商，凭借技术优势获得Meta、亚马逊等巨头长期大额订单，其利润增速远超收入增长，显示定价权与规模效应。 2. **互连层**：**安费诺（Amphenol）** 通过并购扩张，在AI服务器连接器市场增长强劲，利润率提升且估值相对合理。**Credo Technology** 则扮演铜缆与光通信的桥梁角色，增长迅速但客户集中度高，风险较大。 3. **系统层**：**Ciena** 是相干光学龙头，其技术能让现有光纤承载更多数据，订单积压强劲，但估值已较高。 4. **上游材料与测试**：**AXT** 提供光激光器关键材料磷化铟晶圆，具有稀缺性，但面临中国出口管制等高风险。**VEO Solutions** 作为测试设备“卖铲人”，业务不受具体技术路线影响，随着光设备需求爆发而增长。最后，文章提及了专注于光子学的主题ETF（代码FOTO）作为一键配置选择，但提醒其成立时间短、规模尚小。总结认为，铜转光是确定趋势，投资机会将沿整条光子产业链扩散。

marsbit15分钟前

marsbit15分钟前

Collector Crypt 日活仅 800 却已成加密最赚钱项目之一？

Collector Crypt（CARDS）是一个基于Solana区块链的实体交易卡（如宝可梦卡）代币化平台，已通过创新的“Gacha”数字开包系统和高效的链上二级市场，实现了显著的盈利能力。该项目5月年化利润约5300万美元，6月预计可达1.09亿美元，而其完全稀释估值（FDV）仅约5.5亿美元。平台的核心商业模式在于Gacha机器：用户以折扣价开包，可选择保留卡片或折价卖回平台。多数用户为追求稀有卡而卖出普通卡，使得用户平均获得约2%的正期望价值（EV），而平台则捕获约4.5%的利润空间。此外，其原生二级市场手续费仅为2%，相比eBay等传统平台16%-20%的综合成本具有显著优势。除了Gacha利润，未来收入还将来自二级市场手续费、合作伙伴分成以及“eBay狙击工具”等新功能。平台已积累约2300万美元的卡片库存和1000万美元现金，并启动了代币回购。尽管目前日活用户仅约800人，但Collector Crypt已证明其产品与市场的契合度，并正从交易卡扩展至体育卡等更多收藏品类。它旨在为机构投资者提供一个高效参与收藏品市场的链上金融基础设施，有望抓住从Web2向链上迁移的浪潮。分析师给予其目标价：夏季结束前达到4美元。

Foresight News27分钟前

Foresight News27分钟前

美国参议院着眼2026年秋季推出加密货币税收法案，力推《CLARITY法案》

美国参议院正推进针对加密货币的专项税收立法工作，目标是在2026年秋季前推出相关法案。参议员史蒂夫·戴恩斯透露，共和党议员已为此制定了立法框架，并表示立法进程将“宜早不宜迟”。该框架与众议院近期提出的加密税收法案思路相似。国会日益关注虚拟资产的清晰税收规则，参议院财政委员会此前已就质押奖励、挖矿等数字资产涉税问题举行听证会。众议院方面也已提出多个草案，涉及质押、挖矿、去中心化金融及稳定币交易等议题。与此同时，旨在建立全面加密监管框架的《数字资产市场清晰法案》仍是立法优先事项。该法案已获参议院银行委员会两党投票通过，目前正在讨论中。超过200家加密公司呼吁参议院领导层推动该法案投票，认为明确的监管将促进美国数字资产市场的创新与投资。市场参与者密切关注这两项立法进展，因为税收与监管对加密行业和投资者至关重要。分析认为，参议院的税收条款与《清晰法案》是互补举措，旨在共同构建更完整的数字资产监管框架。尽管两项法案均未最终获批，但国会的活跃动向表明加密立法进程正在加速。

TheNewsCrypto29分钟前

TheNewsCrypto29分钟前

新主席、旧通胀、超预期就业：沃什首秀后，全球资产如何重新定价？

上周，美联储新任主席凯文·沃什主持了上任后首次货币政策会议。会议决定维持利率不变，但政策声明被大幅简化，删除了前瞻指引等措辞。沃什强调，美联储不应过早承诺未来行动，而应让市场重新聚焦经济数据本身。这标志着一套新沟通框架的开始。沃什面临的首要挑战仍是通胀。4月PCE通胀数据仍远高于2%目标，且通胀压力来源多元。与此同时，5月就业数据远超预期，强劲的就业市场反而加剧了市场对货币政策收紧的担忧，导致股市下跌。沃什接手的是一个在政策方向上存在内部分歧、且面临政治压力的美联储，建立内部共识是其重要考验。会议释放的鹰派信号影响了全球资产定价。美元因加息预期升温而走强，美债收益率面临上行压力但亦受经济前景影响，黄金在利率与地缘风险间拉锯。AI基建等成长股板块面临估值压力，但若云厂商资本开支未收缩，产业逻辑依然成立。防务板块则因订单确定性而具有一定防御性。展望未来，市场需重点关注数据：7月初的6月非农数据可能决定7月会议基调；7月中旬的6月CPI数据直接影响通胀判断；7月底的第二次FOMC会议将是沃什做出实质性政策选择的关键节点。此外，下半年美国中期选举带来的政治压力，也将持续考验美联储的独立性。

marsbit40分钟前