3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbitPublicado em 2026-06-18Última atualização em 2026-06-18

Resumo

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

Perguntas relacionadas

QWhat is the name and key characteristic of the small AI model discussed in the article?

AThe model is called VibeThinker-3B. Its key characteristic is that despite being a small model with only 3 billion parameters, it achieves reasoning performance on verifiable tasks like programming that is comparable to much larger frontier models.

QWhich company or team developed the VibeThinker-3B model?

AThe VibeThinker-3B model was developed by the Sina Weibo (microblog) team, making it a domestic Chinese model.

QWhat is the core hypothesis proposed by the creators of VibeThinker-3B regarding model capabilities?

AThe core hypothesis is the 'Parameter-Compression Coverage Hypothesis'. It suggests that different capabilities depend on model parameters in distinct ways. Verifiable reasoning (multi-step reasoning, constraint satisfaction) is highly compressible and parameter-dense. In contrast, open-domain knowledge and understanding rely more on large-scale parameters for broad factual coverage.

QOn which specific benchmark tasks did VibeThinker-3B demonstrate exceptional performance?

AVibeThinker-3B demonstrated exceptional performance on verifiable reasoning benchmarks such as AIME26 (97.1 with CLR), HMMT25 (95.4 with CLR), LiveCodeBench v6 (80.2 Pass@1), and recent private LeetCode contests (96.1% pass rate).

QWhat are the main limitations or scope of application for the VibeThinker-3B model as stated in the article?

AThe model's applicability is limited. It excels in domains with clear verification signals (math, programming, STEM) but does not perform well in areas requiring general world knowledge, open-domain dialogue, or understanding of long-tail scenarios, as these rely on broader parametric coverage.

Leituras Relacionadas

With SpaceX, OpenAI, and Anthropic Listing in Succession, Can the Market Really Absorb Them?

The article analyzes the market's capacity to absorb the concurrent mega-IPOs of SpaceX, OpenAI, and Anthropic, which could collectively seek over $200 billion from public markets—more than four times the total U.S. IPO volume in 2025. SpaceX, already public at a $2.1 trillion valuation, demonstrated strong demand with significant oversubscription. Anthropic, targeting a Q4 2026 listing, presents a compelling financial story with rapid revenue growth and a projected first quarterly operating profit. OpenAI, however, faces the greatest scrutiny due to its high cash burn and lack of profitability, with its eventual financial disclosures posing a key risk to its ~$1 trillion target valuation. Wall Street sentiment is divided: bulls point to ample liquidity in money markets and pent-up demand for pure-play AI stocks, while bears warn of a liquidity drain and risk transfer from private to public investors. A "capitulation long" mindset—participating despite bubble concerns—is also noted. The author concludes the market has the appetite for individual offerings, as shown by SpaceX. The real test will be whether OpenAI and Anthropic's financial fundamentals can support their valuations upon public scrutiny. The ultimate question for AI valuations remains whether enterprise adoption translates to tangible cost savings or revenue generation.

marsbitHá 20m

With SpaceX, OpenAI, and Anthropic Listing in Succession, Can the Market Really Absorb Them?

marsbitHá 20m

SK Hynix Stock Price Hits New High: Delivers HBM4E Samples, Reinforcing Its Leading Position in AI Memory

SK Hynix Delivers HBM4E Samples, Shares Hit Record High on Strong AI Memory Outlook SK Hynix has delivered samples of its next-generation AI memory chip, HBM4E, to major customers, sending its stock price soaring 7.3% to a historic high. The new 12-layer stacked flagship product offers a data processing speed of 16Gbps per pin, a more than 20% improvement in power efficiency, and a 17% reduction in thermal resistance compared to the previous generation. It also achieves a single-chip capacity of 48GB, enabled by Advanced MR-MUF packaging technology. This sample delivery accelerates SK Hynix's technological iteration in the high-bandwidth memory (HBM) field, solidifying its core position in the AI infrastructure supply chain. The market response reflects strong confidence in the company's ability to maintain leadership in the AI memory race, building on its established track record of mass production and supply for HBM3, HBM3E, and HBM4. The performance and efficiency gains of HBM4E are expected to enhance data processing capabilities in AI training and inference scenarios, addressing performance bottlenecks in next-generation AI systems.

marsbitHá 43m

SK Hynix Stock Price Hits New High: Delivers HBM4E Samples, Reinforcing Its Leading Position in AI Memory

marsbitHá 43m

Ethereum 2026 Q1 Review: On-Chain Activity Hits Record Highs, Tokenized Assets Lead the Industry

Ethereum Q1 2026 Review: Record On-Chain Activity, Tokenized Assets Lead the Industry. Despite a price correction impacting USD-denominated metrics, Ethereum's on-chain usage hit all-time highs in Q1 2026. Monthly active addresses surged 85.9% year-over-year to 13.2 million, while L1 transactions and throughput also set new records. This growth occurred alongside a significant 47.9% quarterly drop in L1 transaction fees, demonstrating the impact of network scaling via upgrades like the Blob Parameter Fork. The ecosystem maintained its dominance in decentralized finance (DeFi), holding 71% of the total value locked among top chains and 79.2% of active borrowing. Ethereum solidified its position as the primary platform for tokenized real-world assets (RWAs), with a total market cap of $203.4B. It holds leading shares in stablecoins (61.8%), tokenized funds (73%), and tokenized commodities (84%) across major chains. Key developments included the ERC-8004 standard for AI agents and heightened institutional engagement at forums. Major financial institutions like BlackRock, JPMorgan, and a European banking consortium announced new tokenized products on Ethereum throughout the period. The report draws parallels to the early internet, suggesting Ethereum is sacrificing short-term fee revenue for long-term network expansion and adoption. Its strategy focuses on becoming a neutral, open settlement layer for global finance, with scaling roadmaps aiming for tens of thousands of TPS by 2029.

marsbitHá 44m

Ethereum 2026 Q1 Review: On-Chain Activity Hits Record Highs, Tokenized Assets Lead the Industry

marsbitHá 44m

Either Go Full-Stack or Get Out: The Calculations Behind xAI's $60 Billion Acquisition of Cursor

The article discusses xAI's $60 billion stock acquisition of Anysphere, the parent company of the AI coding tool Cursor, arguing that the core motivation is not market share but access to high-quality training data from its 7 million daily developers. It posits that to become a major AI player, a company must build a full-stack encompassing compute, model, and application layers. This thesis is illustrated by Anthropic's 540x revenue growth in 28 months, largely driven by its coding product, Claude Code, which captured 54% of the enterprise AI programming market. The author, a VC, contends that full-stack integration creates sustainable unit economics for model training and provides proprietary data for defensible competitive advantages, predicting a wave of model companies aggressively building or acquiring application-layer products. The central takeaway is that in an era where building products is 10x easier, ambition must be 10x greater to succeed.

marsbitHá 54m

Either Go Full-Stack or Get Out: The Calculations Behind xAI's $60 Billion Acquisition of Cursor

marsbitHá 54m

Matrixdock Featured Again in SBMA’s 《Crucible》: Discussing How Tokenisation Enhances Efficiency in the Precious Metals Market

Matrixdock's research article, titled "Why Tokenisation Matters for the Bullion Industry and How Carrying Costs Fit In," has been featured again in the SBMA's industry publication *Crucible*. Authored by Matrixdock lead Eva Meng, the piece examines how tokenisation enhances the efficiency and utility of the precious metals market. The article argues that tokenisation builds upon the accessibility improvements brought by gold ETFs, not by redefining gold's value but by enabling it to function within digital finance. It extends gold's role beyond a portfolio holding, potentially facilitating instant settlement, digital collateral, and operation in 24/7 markets. A key focus is transparently handling the unavoidable carrying costs (storage, insurance) of physical assets like gold and silver. Matrixdock introduces the Fungible Reserve Standard (FRS) framework, based on an "Economic Purity Principle," which aims to reflect these real-world economic costs clearly within the token mechanism, rather than bundling them opaquely. The platform's practical applications are highlighted, including its gold token XAUm and its silver token XAGm, the first built on the FRS framework. As the tokenised gold market surpassed $6 billion in February 2026, the industry's focus is shifting from initial proofs of reserves to broader concerns of market efficiency and capital utilization. Tokenisation is positioning gold and other precious metals to become active components within the evolving digital financial system.

marsbitHá 1h

Matrixdock Featured Again in SBMA’s 《Crucible》: Discussing How Tokenisation Enhances Efficiency in the Precious Metals Market