3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbitPublicado a 2026-06-18Actualizado a 2026-06-18

Resumen

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

  • Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
  • Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
  • Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
  • Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

Preguntas relacionadas

QWhat is the name and key characteristic of the small AI model discussed in the article?

AThe model is called VibeThinker-3B. Its key characteristic is that despite being a small model with only 3 billion parameters, it achieves reasoning performance on verifiable tasks like programming that is comparable to much larger frontier models.

QWhich company or team developed the VibeThinker-3B model?

AThe VibeThinker-3B model was developed by the Sina Weibo (microblog) team, making it a domestic Chinese model.

QWhat is the core hypothesis proposed by the creators of VibeThinker-3B regarding model capabilities?

AThe core hypothesis is the 'Parameter-Compression Coverage Hypothesis'. It suggests that different capabilities depend on model parameters in distinct ways. Verifiable reasoning (multi-step reasoning, constraint satisfaction) is highly compressible and parameter-dense. In contrast, open-domain knowledge and understanding rely more on large-scale parameters for broad factual coverage.

QOn which specific benchmark tasks did VibeThinker-3B demonstrate exceptional performance?

AVibeThinker-3B demonstrated exceptional performance on verifiable reasoning benchmarks such as AIME26 (97.1 with CLR), HMMT25 (95.4 with CLR), LiveCodeBench v6 (80.2 Pass@1), and recent private LeetCode contests (96.1% pass rate).

QWhat are the main limitations or scope of application for the VibeThinker-3B model as stated in the article?

AThe model's applicability is limited. It excels in domains with clear verification signals (math, programming, STEM) but does not perform well in areas requiring general world knowledge, open-domain dialogue, or understanding of long-tail scenarios, as these rely on broader parametric coverage.

Lecturas Relacionadas

Gate Research Institute: Analysis of Chart Patterns and Breakout Trading Strategies

Gate Research Institute: Chart Pattern Analysis and Breakout Trading Strategies Chart patterns are crucial tools in technical analysis for observing market supply and demand shifts, trend continuations, and reversals. This analysis involves a comprehensive evaluation of trend, volume, support/resistance, time cycles, and breakout validity, not just rote pattern recognition. Patterns are broadly categorized into reversal patterns (e.g., Double Tops/Bottoms, Head and Shoulders) and continuation patterns (e.g., Flags, Triangles, Rectangles). An effective breakout, key for trading, requires clear support/resistance, prolonged consolidation, a prevailing trend backdrop, and volume confirmation. However, breakouts are not guaranteed, as false breakouts are common. Risk must be managed through position sizing, stop-loss orders, pullback confirmations, and profit-taking in stages. Key pattern types discussed include: * **Rectangle Patterns:** Indicate market indecision within parallel support and resistance, with breakouts projecting a move equal to the pattern's width. * **Flag & Pennant Patterns:** Short-term continuation patterns following sharp price moves ("flagpoles"). * **Triangle Patterns:** Symmetrical, Ascending (bullish bias), and Descending (bearish bias) triangles, representing consolidation before a directional move. * **Head and Shoulders Patterns:** Major reversal patterns signaling trend exhaustion. The article details breakout trading strategies, defining valid breakouts by price closing beyond a key level with increased volume and minimal immediate re-entry into the prior range. It contrasts range trading with breakout trading and outlines entry methods (immediate entry, pullback entry, scaling in), stop-loss placement (based on pattern failure), and profit-taking techniques (target-based, structure-based, trend-following). It further classifies breakout outcomes: 1. **Valid Breakouts:** Strong, sustained moves in the breakout direction. 2. **Pullback Breakouts:** Price breaks out, retests the breakout level as support/resistance, then resumes the trend—offering a lower-risk entry. 3. **False Breakouts:** Price briefly breaches a level but quickly reverses back into the prior range, a common risk managed by strict stop-losses. Key validation tools for breakouts include volume analysis, the principle of support/resistance role reversal, and momentum indicators like ATR, Moving Averages, Bollinger Bands, and RSI. In conclusion, while chart patterns and breakout analysis provide a structured framework, their effectiveness relies on multiple confirming factors—trend context, volume, and proper risk management. They should be integrated into a broader trading system rather than used as standalone signals.

marsbitHace 18 min(s)

Gate Research Institute: Analysis of Chart Patterns and Breakout Trading Strategies

marsbitHace 18 min(s)

Joseph Chalom: Ethereum is Becoming the "Settlement Layer of Trust" for Global Finance

In a speech titled "The Industrialization of Trust," Sharplink CEO Joseph Chalom (former BlackRock digital assets head) discussed the future transformation of global finance. Drawing from 20 years at BlackRock, where he led the launch of Bitcoin/ETH ETFs and tokenized funds, Chalom highlighted the immense hidden costs of establishing trust in traditional finance—estimated at over $9.3 trillion annually in the US alone due to fragmented systems, multi-day settlements, and countless reconciliations. He argued that Ethereum is emerging as the global financial "settlement layer for trust," with its robust, decentralized infrastructure securing over $300 billion in on-chain assets and most stablecoins and tokenized assets. The future, he stated, will be driven by three accelerating pillars: stablecoins (evolving beyond crypto gateways to become efficient cross-border payment rails), tokenized assets (enabling 24/7 trading and reshaping capital markets), and DeFi (providing automated, accessible financial services). A potential game-changer, Chalom added, is the fourth pillar: "Agentic Finance," where AI agents autonomously execute programmable financial transactions via smart contracts and stablecoins. He envisions individuals soon having AI-powered "CFOs in their pockets" to optimize idle capital and manage tokenized portfolios. This shift, facilitated by Ethereum's trustless settlement, could multiply on-chain transaction volume 1000x within a year, moving finance toward a seamless, digitized future.

marsbitHace 19 min(s)

Joseph Chalom: Ethereum is Becoming the "Settlement Layer of Trust" for Global Finance

marsbitHace 19 min(s)

STRC Severely Unpegged, What Risks Is the Market Pricing In?

The article analyzes the recent significant de-pegging of Strategy's perpetual preferred stock, STRC, whose price fell to approximately $89, far below its $100 face value. This discount has pushed its simple yield to around 12.9%, creating a paradox. The stock was designed as a high-yield instrument trading near par, and Strategy maintains an 11.5% annual dividend, even recently switching to semi-monthly payments to support the price. The author explores several reasons why the high yield hasn't attracted enough buying pressure to restore the par value. A key factor is potential reverse deleveraging from carry trades, where leveraged investors may be forced to sell due to margin calls as the price falls, creating a self-reinforcing downward spiral. Additionally, the tokenization and integration of STRC into DeFi protocols (like Apyx, Saturn, Pendle) have introduced faster, more transparent, and potentially more volatile price adjustment mechanisms through leverage and yield-splitting products. The emergence of a competing product, Strive's SATA, offering a 13% yield with daily dividends, has also changed the yield benchmark, challenging STRC's unique high-yield narrative. Furthermore, the market is questioning the distinction between Strategy's substantial Bitcoin reserves, which provide long-term balance sheet coverage, and the certainty of stable near-term cash flow for dividends. Ultimately, the price dip represents a stress test for this type of BTC-backed, high-yield financing tool. The future path of STRC depends on whether Strategy acts to reinforce the $100 peg (e.g., by adjusting dividends), whether DeFi-related leverage unwinds further, and how investors ultimately price the risks of leverage, competition, and cash flow uncertainty against the offered yield.

marsbitHace 30 min(s)

STRC Severely Unpegged, What Risks Is the Market Pricing In?

marsbitHace 30 min(s)

LIT Token Hits Six-Month High: How Long Can the Buyback Flywheel Keep Burning Fuel?

The LIT token of decentralized perpetual exchange Lighter surged to a six-month high above $1.90 on June 18th, with a market cap of $425 million. After a price correction earlier this year, the recent rebound is attributed to its core "buyback flywheel" mechanism. All protocol fee revenue is used for programmatic, hourly market buybacks of LIT. Since its TGE in December 2025, approximately 15 million LIT (6% of circulating supply) has been repurchased for around $21 million. Additional price support comes from the LLP (Lighter Liquidity Pool), where providers must stake LIT worth 10% of their deposited USDC, locking significant token supply. However, challenges persist. Trading volume has declined amidst a sluggish market, with total volume at $1.68 trillion, significantly lower than leading competitor Hyperliquid's $4.37 trillion. While Lighter focuses on perpetual contracts, RWA, and Pre-IPO markets, Hyperliquid has expanded into prediction markets and boasts a U.S. spot ETF, attracting institutional investment and influencer endorsements like from Arthur Hayes. In contrast, LIT currently lacks similar high-profile backing. With 75% of LIT's total 1 billion supply still locked (team and investor tokens begin a 3-year linear unlock in December 2026), there is no immediate unlock selling pressure. The token's future performance hinges on sustaining trading volume growth, successful product iteration, and executing its transparent buyback strategy against a dominant competitor.

Foresight NewsHace 49 min(s)

LIT Token Hits Six-Month High: How Long Can the Buyback Flywheel Keep Burning Fuel?

Foresight NewsHace 49 min(s)

Trading

Spot
Futuros
活动图片