3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbitPublished on 2026-06-18Last updated on 2026-06-18

Abstract

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

  • Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
  • Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
  • Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
  • Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

Related Questions

QWhat is the name and key characteristic of the small AI model discussed in the article?

AThe model is called VibeThinker-3B. Its key characteristic is that despite being a small model with only 3 billion parameters, it achieves reasoning performance on verifiable tasks like programming that is comparable to much larger frontier models.

QWhich company or team developed the VibeThinker-3B model?

AThe VibeThinker-3B model was developed by the Sina Weibo (microblog) team, making it a domestic Chinese model.

QWhat is the core hypothesis proposed by the creators of VibeThinker-3B regarding model capabilities?

AThe core hypothesis is the 'Parameter-Compression Coverage Hypothesis'. It suggests that different capabilities depend on model parameters in distinct ways. Verifiable reasoning (multi-step reasoning, constraint satisfaction) is highly compressible and parameter-dense. In contrast, open-domain knowledge and understanding rely more on large-scale parameters for broad factual coverage.

QOn which specific benchmark tasks did VibeThinker-3B demonstrate exceptional performance?

AVibeThinker-3B demonstrated exceptional performance on verifiable reasoning benchmarks such as AIME26 (97.1 with CLR), HMMT25 (95.4 with CLR), LiveCodeBench v6 (80.2 Pass@1), and recent private LeetCode contests (96.1% pass rate).

QWhat are the main limitations or scope of application for the VibeThinker-3B model as stated in the article?

AThe model's applicability is limited. It excels in domains with clear verification signals (math, programming, STEM) but does not perform well in areas requiring general world knowledge, open-domain dialogue, or understanding of long-tail scenarios, as these rely on broader parametric coverage.

Related Reads

Full Debut Q&A! Fed Chair Wash: Firmly Adhering to 2% Inflation Target, Establishing Five Special Task Forces, Personally Did Not Submit Dot Plot

Federal Reserve Chair Kevin Warsh delivered his first FOMC press conference, maintaining the federal funds rate at 3.5%-3.75% and emphasizing the Committee's unanimous and explicit commitment to achieving its 2% inflation target. Key announcements included significant changes to Fed communication and operations. The policy statement was significantly shortened and, notably, forward guidance was removed. Chair Warsh broke from precedent by declining to submit his own economic forecasts and "dot plot." He announced the immediate formation of five special working groups focusing on: Fed communication, the balance sheet, data sources, productivity and employment (including AI's impact), and the inflation framework. These groups, which will include external experts, are tasked with recommending improvements by year-end. One key group will review the Fed's $6.7 trillion balance sheet to assess the roles of interest rates versus balance sheet tools in monetary policy. Warsh characterized the current restrictive stance of policy as "uneven," noting its effect on housing but questioning its impact on financial markets where conditions appear less restrictive. He expressed a desire to move away from a "Fed-speak" driven market, arguing that markets should react to economic data rather than Fed commentary to provide better informational signals. On inflation, he stated there is no need to reconsider the 2% target until the Fed re-establishes its commitment and capability to achieve it. Economic projections (SEP) from other officials showed a split on the rate outlook for 2024, with half expecting at least one hike and half forecasting unchanged or lower rates. The median projection saw the federal funds rate at 3.8% by year-end 2024. Following the announcements, risk assets sold off sharply, Treasury yields rose, and the dollar strengthened.

marsbit11m ago

Full Debut Q&A! Fed Chair Wash: Firmly Adhering to 2% Inflation Target, Establishing Five Special Task Forces, Personally Did Not Submit Dot Plot

marsbit11m ago

Full First Q&A! Fed Chair Warsh: Sticks to 2% Inflation Target, Establishes Five Special Working Groups, Personally Did Not Submit Dot Plot

The Federal Reserve, under new Chair Kevin Warsh, held its first FOMC meeting, maintaining the federal funds rate target range at 3.5% to 3.75%. The central bank issued a significantly shortened policy statement, explicitly removing forward guidance. Chair Warsh delivered a strong, unified commitment to achieving the 2% inflation target, stating the FOMC has the "capability and the commitment" to restore price stability and sees no need to review the target itself at this time. Warsh announced the immediate formation of five special working groups to examine and propose improvements in key areas by year-end: Fed communication, the balance sheet (including a review of the $6.7 trillion portfolio and its role in policy), data sources and methodology, productivity and employment (including AI's impact), and the inflation framework. In a break from tradition, Chair Warsh did not submit his own economic projections or "dot plot." The submitted Summary of Economic Projections (SEP) showed a split among other officials: half anticipate at least one rate hike this year, while half expect rates to remain steady or fall. The median projection sees the federal funds rate at 3.8% by year-end 2026. Warsh characterized the current policy stance as "uneven," noting restrictive effects in sectors like housing but less so in financial markets. He emphasized a desire to move away from a market dynamic overly focused on Fed signaling, advocating for markets to react more to economic data. On AI, he called it potentially the most significant economic change in his adult life, driving clear demand but with uncertain timing and scale on the supply side, creating a "race" between the two.

链捕手14m ago

Full First Q&A! Fed Chair Warsh: Sticks to 2% Inflation Target, Establishes Five Special Working Groups, Personally Did Not Submit Dot Plot

链捕手14m ago

The DeepSeek Financing Story

DeepSeek's financing round, totaling approximately 3 billion USD, concluded recently, revealing details about the process and key investors. The round was initiated around April with strict initial terms: a minimum commitment of 5 billion RMB, no syndication, and a pure RMB structure. These were later relaxed, with the minimum ticket size reduced to 1.5 billion RMB. A pivotal four-hour online investor meeting in mid-May served as the primary interaction for many backers with DeepSeek's founder, Liang Wenfeng. Despite not being a naturally eloquent speaker, Liang's philosophy deeply resonated. He consistently emphasized the company's singular focus on AGI (Artificial General Intelligence), the principle of "less is more," extreme caution in spending, and the paramount importance of team stability. His notable quotes included describing the team as "ordinary people doing extraordinary things" and stating that "AGI is a big enough thing; everything else is just process." The final investor list featured 10 entities, but underlying fund structures indicate participation from nearly 100 institutions and individuals. Notable lead investors include Monolith Capital (increasing its commitment from 1.5 to 3 billion RMB), Zhenxingu Capital, IDG Capital, and state-affiliated investors like Guozhitou. Conspicuously absent were major firms like Sequoia China and Hillhouse Capital, despite earlier speculation about their involvement. A core condition set by Liang Wenfeng for all investors, whether corporate or venture capital, was a strict prohibition against poaching DeepSeek employees or encouraging them to leave to start ventures. The financing process highlighted DeepSeek's unexpected openness to external capital, surprising many in the investment community. The company's low-profile nature, combined with its ambitious AGI vision and principled approach, fostered a sense of reverence among participating investors, many of whom were reluctant to discuss the deal publicly, preferring to maintain its discreet and purposeful ethos.

链捕手21m ago

The DeepSeek Financing Story

链捕手21m ago

The DeepSeek Fundraising Story

"The DeepSeek Funding Story: Insights from the $2.15 Billion Round" This article details behind-the-scenes narratives from DeepSeek's recent massive funding round. Key highlights include the legendary four-hour online investor meeting where CEO Liang Wenfeng, despite not being a charismatic speaker, impressed attendees with his focus on AGI and team stability. He emphasized the philosophy of "ordinary people doing extraordinary things" and a steadfast commitment to solely advancing intelligence. The fundraising process, initiated in April, saw initial demands for a minimum investment of 5 billion RMB, no syndication, and a pure RMB structure. These terms were later adjusted to a 1.5 billion RMB minimum to accommodate more investors. A notable absence was the lack of participation from major VC firms Sequoia China and Hillhouse Capital, despite early rumors, making IDG the only established VC in the final lineup. The investor list, while showing 10 entities, actually involved nearly 100 underlying institutions and individuals upon closer examination. Significant participants included Monolith Capital, which doubled its commitment to 3 billion RMB, and Zhenxingu Capital, an unexpected entrant. Liang Wenfeng's paramount condition for all investors was a strict agreement not to poach DeepSeek employees. The article reflects on DeepSeek's unexpected openness to funding and the mix of strategies—synergy, insight, brand alignment, and persistence—that secured investors a stake. The overarching sentiment among participants is one of pride and a shared belief in DeepSeek's potential to become a landmark Chinese company, driven by a profound sense of purpose in the AGI race.

marsbit23m ago

The DeepSeek Fundraising Story

marsbit23m ago

DAT Companies Take on Side Businesses

The article details the strategic shifts of Digital Asset Treasury (DAT) companies amid a prolonged market downturn. Initially popularized by firms like MicroStrategy, the DAT model—where companies hold cryptocurrencies on their balance sheets—has faced significant pressure as crypto ETFs eroded their investment thesis and bear markets strained finances. Key players are adapting in several ways. Some, like ETHzilla, have abandoned the model entirely. Others are pivoting towards becoming active, revenue-generating participants in the crypto ecosystem rather than passive holders. This evolution is taking two main paths: 1. **Transformation into Institutional Crypto Asset Managers:** Companies like SharpLink Gaming and GameSquare are leveraging their holdings to generate yield, moving beyond simple storage. They employ strategies like 100% ETH staking or using machine learning to optimize DeFi yields, positioning themselves as platforms offering institutional-grade crypto yield products. 2. **Becoming Blockchain Infrastructure Operators:** This is prominent in the Solana ecosystem. Firms like DeFi Development and SOL Strategies are acquiring validator businesses, issuing liquid staking tokens (e.g., dfdvSOL, fwdSOL), and integrating them into DeFi protocols. They generate fee-based revenue from these operations, building network effects. The article notes that successful转型 hinges on building operational expertise and defensible business models around crypto assets, such as technical advantages or deep ecosystem integration. However, these new paths carry risks like smart contract vulnerabilities or ecosystem dependence. Ultimately, this collective shift signals a maturation phase for the DAT movement. It highlights that in crypto, sustainable value is increasingly tied to active participation, cash flow generation, and providing real utility, rather than speculative capital allocation alone.

Foresight News26m ago

DAT Companies Take on Side Businesses

Foresight News26m ago

Trading

Spot
Futures
活动图片