3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbitPubblicato 2026-06-18Pubblicato ultima volta 2026-06-18

Introduzione

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

Domande pertinenti

QWhat is the name and key characteristic of the small AI model discussed in the article?

AThe model is called VibeThinker-3B. Its key characteristic is that despite being a small model with only 3 billion parameters, it achieves reasoning performance on verifiable tasks like programming that is comparable to much larger frontier models.

QWhich company or team developed the VibeThinker-3B model?

AThe VibeThinker-3B model was developed by the Sina Weibo (microblog) team, making it a domestic Chinese model.

QWhat is the core hypothesis proposed by the creators of VibeThinker-3B regarding model capabilities?

AThe core hypothesis is the 'Parameter-Compression Coverage Hypothesis'. It suggests that different capabilities depend on model parameters in distinct ways. Verifiable reasoning (multi-step reasoning, constraint satisfaction) is highly compressible and parameter-dense. In contrast, open-domain knowledge and understanding rely more on large-scale parameters for broad factual coverage.

QOn which specific benchmark tasks did VibeThinker-3B demonstrate exceptional performance?

AVibeThinker-3B demonstrated exceptional performance on verifiable reasoning benchmarks such as AIME26 (97.1 with CLR), HMMT25 (95.4 with CLR), LiveCodeBench v6 (80.2 Pass@1), and recent private LeetCode contests (96.1% pass rate).

QWhat are the main limitations or scope of application for the VibeThinker-3B model as stated in the article?

AThe model's applicability is limited. It excels in domains with clear verification signals (math, programming, STEM) but does not perform well in areas requiring general world knowledge, open-domain dialogue, or understanding of long-tail scenarios, as these rely on broader parametric coverage.

Letture associate

Football Draw Harvests Whales: Extreme Profit-Loss Divergence on Polymarket's World Cup

A bettor known as "fishalive" made a stunning profit of nearly $9 million on the Polymarket prediction platform by correctly wagering against favorites during the 2026 FIFA World Cup group stage. The account, registered just before the tournament, risked roughly $400,000 on two contracts for a Spain vs. Cape Verde match: one that Spain would *not* win, and another on a Cape Verde +2.5 goal handicap. The resulting 0-0 draw triggered both payouts. This single event, with a total market volume of $64 million, highlighted extreme profit-and-loss divergence. Other traders, like "betoor619" and "leeeeroyjenkins," lost millions by betting heavily on favorites Spain and Belgium to win outright—contracts that become worthless in a draw. The article explains that while markets heavily favored strong teams, the "team to win" contracts are binary and do not account for the common outcome of a draw. This creates high-risk, low-reward scenarios for favorite backers, while asymmetric profits flow to those betting on underdogs or against outright wins. The transparency of Polymarket's on-chain ledger publicly documents these massive wins and losses, driving mainstream media coverage. As the tournament progresses, the author suggests traders may shift towards hedging strategies that account for draws. The piece also notes growing regulatory scrutiny in the US and Europe, questioning whether such large-scale, anonymous sports prediction markets should be regulated as gambling or financial derivatives.

Foresight News1 min fa

Football Draw Harvests Whales: Extreme Profit-Loss Divergence on Polymarket's World Cup

Foresight News1 min fa

ChatGPT Loses Half Its Market: From Monopoly to Shared Market in Three and a Half Years

In a landmark shift three and a half years after its debut, ChatGPT's global market share in the AI assistant market has fallen below 50% for the first time, dropping to 46.4% as of May 2026. This signals the end of its initial dominance, with the market now diversifying among competitors like Gemini (27.7%) and Claude (10.3%). The report from Sensor Tower indicates the AI assistant landscape has matured from a phase of awe and experimentation into one of product comparison, ecosystem integration, and monetization. Users are increasingly pragmatic, readily switching between assistants based on specific use cases, brand trust, and value propositions. The industry is moving past the "free lunch" era, with users demonstrating a willingness to pay for premium features, driving significant in-app expenditure. Major players are adopting varied monetization strategies: Claude boasts a high subscription conversion rate, while ChatGPT is increasingly testing ads and shopping integrations to complement its subscription revenue. However, this growth comes with immense costs, as exemplified by OpenAI's soaring cash burn for model training and infrastructure. While ChatGPT remains the largest single player, its declining share symbolizes a broader normalization of AI. The technology is no longer a novelty but an integral, scrutinized part of daily digital life, judged on practical utility, price, and seamless integration. The battle has shifted from proving AI's potential to competing in a crowded field where no single product holds a permanent monopoly.

marsbit12 min fa

ChatGPT Loses Half Its Market: From Monopoly to Shared Market in Three and a Half Years

marsbit12 min fa

Prediction Markets Turn Bearish As Kalshi Traders Price 69% Odds Of Bitcoin Dropping To $50,000 First

Prediction market traders on Kalshi are pricing in a 69% probability that Bitcoin will fall to $50,000 before it reaches $100,000. This reflects a bearish sentiment snapshot among participants on the platform, contrasting with more optimistic cycle-bottom calls from some investors. The odds are not a guaranteed forecast but a live market gauge that can shift quickly with price action and trader positioning. The setup captures a key market debate: whether Bitcoin faces another significant downside leg or is poised for a major bullish breakout driven by institutional demand. The article cautions against over-interpreting the number, noting prediction markets are best used as a sentiment indicator alongside factors like ETF flows and macroeconomic policy.

bitcoinist39 min fa

Prediction Markets Turn Bearish As Kalshi Traders Price 69% Odds Of Bitcoin Dropping To $50,000 First

bitcoinist39 min fa

World Cup Kicks Off, Spotlight on 'Big Wins' and 'Major Losses' in Prediction Markets

"The World Cup has begun, showcasing dramatic gains and losses in prediction markets. With the 2026 FIFA World Cup expected to generate up to $10 billion in consumer betting volume, platforms like Polymarket have seen explosive activity. The article highlights several high-profile traders. One user (@mintblade) achieved a 100% win rate across four bets, earning $9.24 million in a single day, including a $7.34 million profit from correctly predicting Iran would not win against New Zealand. Another new wallet (@Fishalive) turned a $4.22 million bet on Spain not winning against Cape Verde into $13.28 million, a 1000%+ return after a shocking draw. Other successful traders include @LEEEROYJENKINS with $5.2 million in profits and @endlessFate, who turned early losses into over $7.85 million in weekly gains. However, significant losses also occurred. User @betoor619 lost nearly $1 million after a heavily favored Spain failed to defeat Cape Verde. Another user, @weatherman12, lost over $1.81 million betting against Argentina, which won 3-0 with a Lionel Messi hat-trick. The report serves as a reminder of the high volatility and risk in sports prediction markets, where unexpected outcomes are common."

Odaily星球日报51 min fa

World Cup Kicks Off, Spotlight on 'Big Wins' and 'Major Losses' in Prediction Markets

Odaily星球日报51 min fa

Musk's Wealth Surpasses Bitcoin Market Cap: The Wealth Game Behind SpaceX's Soaring Value

Musk's net worth, estimated at $1.32 trillion, has surpassed Bitcoin's market capitalization of $1.29 trillion, driven by SpaceX's stock surging over 50% since its IPO to a $2.7 trillion valuation. This shift highlights changing speculative capital flows, moving from the crypto market—where Bitcoin has retreated over 50% from its 2025 peak—toward high-growth tech stocks like SpaceX. Retail investors, particularly from South Korea, have fueled the rally, making SpaceX one of the most actively traded stocks and its leveraged ETFs record-breaking in volume. However, despite the enthusiasm, SpaceX reported a $49.4 billion net loss in 2025 and a $42.7 billion loss in Q1 2026, reflecting heavy investments in Starlink, AI, and launch infrastructure. Its current valuation largely hinges on Musk's ambitious projection of $1 trillion annual revenue by 2030. The comparison underscores that the market's biggest speculative bet is no longer a cryptocurrency but a rocket company, though questions remain about how long this growth narrative can sustain given SpaceX's significant ongoing losses.

marsbit1 h fa

Musk's Wealth Surpasses Bitcoin Market Cap: The Wealth Game Behind SpaceX's Soaring Value

marsbit1 h fa

Trading

Spot

Futures