3B Small Model's Programming Scores Rival Opus 4.5, Mysterious Model Sparks Heated Discussion, Turns Out to be Domestic

marsbitPublicado a 2026-06-18Actualizado a 2026-06-18

Resumen

A 3B parameter dense reasoning model named VibeThinker-3B has gained significant attention for achieving performance comparable to leading models like Gemini 3 Pro, GPT-5 high, and Claude Opus 4.5 in verifiable reasoning tasks such as programming, mathematics, and STEM problem-solving, despite its significantly smaller size. Developed by Sina Weibo's team, the model is built upon Qwen2.5-Coder-3B. Its training employs an upgraded Spectrum-to-Signal pipeline, featuring a curriculum-based two-stage supervised fine-tuning (SFT), multi-domain reinforcement learning (RL) inspired by MGPO, offline self-distillation, and instruction RL to enhance controllability. A key innovation is the Claim-Level Reliability (CLR) assessment, a test-time scaling strategy that further boosts performance on math benchmarks. The model excels in specific, verifiable domains, scoring highly on tests like AIME26 (94.3/97.1 with CLR) and LiveCodeBench v6 (80.2 Pass@1). However, it performs less impressively in areas requiring broad general knowledge. The authors propose a "parameter compression coverage hypothesis," suggesting that verifiable reasoning abilities—reliant on multi-step logic and feedback—are highly compressible, while open-domain knowledge depends more on large-scale parameters. VibeThinker-3B demonstrates that small models, when specialized for tasks with clear verification signals, can reach frontier performance, offering a complementary research path to scaling model size. The model ...

In recent days, a 3B small model has gained popularity on X because in some difficulty-verifiable reasoning tasks (like programming), it has entered the performance range of frontier models like Gemini 3 Pro, GPT-5 high, Claude Opus 4.5, GLM-5, and Kimi K2.5, while its size is far smaller than these models.

This model is named VibeThinker-3B, a dense reasoning model with 3 billion parameters, aiming to explore how far verifiable reasoning capability can be pushed under strictly small model scale constraints.

After the model's release, many were amazed by its results and expressed a desire to try it out.

Notably, it is also a domestic model, coming from the Sina Weibo team.

The technical report shows that this model is designed specifically for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs exceptionally well in various benchmark tests. It scored 94.3 on the AIME26 test, 89.3 on the HMMT25 test, 80.2 on the LiveCodeBench v6 test (Pass@1), and achieved a 96.1% pass rate in the latest unpublished weekly and biweekly LeetCode contests between April 25 and May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built upon Qwen2.5-Coder-3B and undergoes post-training using an upgraded Spectrum-to-Signal process. This process strengthens data synthesis, quality filtering, and curriculum learning in Supervised Fine-Tuning (SFT), extends MGPO-style reinforcement learning to multiple verifiable domains, preserves complete long-context reasoning trajectories, and consolidates various capabilities through offline self-distillation and instruction reinforcement learning (Instruct RL).

Overall training pipeline of VibeThinker-3B

Spectrum-to-Signal pipeline.

Furthermore, VibeThinker-3B introduces Claim-Level Reliability (CLR) assessment, a test-time scaling strategy for answer-verifiable reasoning. CLR further improves performance on mathematical benchmarks, raising AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

The specific training pipeline is as follows:

  • Curriculum-based two-stage SFT. The first stage focuses on broad capability coverage in mathematics, programming, STEM reasoning, general conversation, and instruction following. The second stage shifts to more difficult, broader-scope reasoning samples. Diversity-Exploring Distillation is used to preserve multiple valid solution paths.
  • Multi-domain reasoning reinforcement learning. VibeThinker-3B reuses MGPO. Reinforcement learning is applied sequentially to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long-context window to preserve complete long-horizon reasoning trajectories.
  • Offline self-distillation. High-quality trajectories are filtered and distilled from the mathematical, programming, and STEM RL checkpoints, ultimately forming a unified student model. Learning Potential Scoring is used to prioritize trajectories that are correct but not yet well imitated by the student.
  • Instruct RL. The final stage improves the controllability for user-facing prompts. For format-sensitive and open-ended instructional data, rule-based verifiers and rubric-based reward models are employed.

In a recent post, well-known AI researcher and blogger Sebastian Raschka systematically summarized key points disclosed in the VibeThinker-3B technical report, including the following:

If you are interested in this content, you can delve into their technical report. Currently, the model is also publicly available for download.

Report Title: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report Link: https://arxiv.org/pdf/2606.16140

HuggingFace Link: https://huggingface.co/WeiboAI/VibeThinker-3B

However, the model's applicable scope has clear limitations, as it does not perform well in domains requiring general knowledge.

The developers also explicitly point this out and propose the "Parameter Compression Coverage Hypothesis": Different capabilities rely on model parameters in drastically different ways. Verifiable reasoning is closer to a highly compressible, parameter-dense ability whose core lies in multi-step reasoning, constraint satisfaction, self-correction, and answer verification. When the task space structure is clear enough and feedback signals are sufficiently reliable, compact models can also possess near-state-of-the-art reasoning capabilities. In contrast, open-domain knowledge, general conversation, and long-tail scenario understanding rely more on large-scale parameters to extensively cover facts, concepts, and world knowledge. This hypothesis is very insightful. VentureBeat wrote in its report: "It reveals a partial decoupling between reasoning capability and factual knowledge, and that the former can be compressed more efficiently than previously thought — an insight that has profound implications for how the industry thinks about model design, deployment costs, and the accessibility of advanced AI capabilities."

The authors state that their goal is not to create a small model to replace large-scale models, but to examine the true boundaries of small models along specific capability dimensions. With VibeThinker-3B, they hope to demonstrate that small models should not merely be seen as a compromise to reduce deployment costs. In capability domains with clear feedback and verification mechanisms, small language models are revealing a promising research path, potentially achieving frontier-level performance and forming a fundamentally complementary relationship with the traditional paradigm of parameter scaling.

Currently, the model still faces some skepticism within the community. If you are interested in this model, you might want to try it out for yourself.

Reference Links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Zhang Qian

Preguntas relacionadas

QWhat is the name and key characteristic of the small AI model discussed in the article?

AThe model is called VibeThinker-3B. Its key characteristic is that despite being a small model with only 3 billion parameters, it achieves reasoning performance on verifiable tasks like programming that is comparable to much larger frontier models.

QWhich company or team developed the VibeThinker-3B model?

AThe VibeThinker-3B model was developed by the Sina Weibo (microblog) team, making it a domestic Chinese model.

QWhat is the core hypothesis proposed by the creators of VibeThinker-3B regarding model capabilities?

AThe core hypothesis is the 'Parameter-Compression Coverage Hypothesis'. It suggests that different capabilities depend on model parameters in distinct ways. Verifiable reasoning (multi-step reasoning, constraint satisfaction) is highly compressible and parameter-dense. In contrast, open-domain knowledge and understanding rely more on large-scale parameters for broad factual coverage.

QOn which specific benchmark tasks did VibeThinker-3B demonstrate exceptional performance?

AVibeThinker-3B demonstrated exceptional performance on verifiable reasoning benchmarks such as AIME26 (97.1 with CLR), HMMT25 (95.4 with CLR), LiveCodeBench v6 (80.2 Pass@1), and recent private LeetCode contests (96.1% pass rate).

QWhat are the main limitations or scope of application for the VibeThinker-3B model as stated in the article?

AThe model's applicability is limited. It excels in domains with clear verification signals (math, programming, STEM) but does not perform well in areas requiring general world knowledge, open-domain dialogue, or understanding of long-tail scenarios, as these rely on broader parametric coverage.

Lecturas Relacionadas

Don't Just Focus on Layoffs, The New Structure of the Ethereum Foundation is More Worthy of Appreciation

The Ethereum Foundation (EF) has undergone a significant organizational restructuring, with the most notable change being a strategic refocusing of its priorities rather than just a 20% staff reduction (approximately 54 people). The new structure clearly prioritizes the Protocol and Access layers, which now comprise the largest teams (57 and 34 people, respectively). This signals EF's intent to concentrate its core resources on fundamental, hard-to-outsource aspects of Ethereum: protocol evolution, security, privacy, client development, and the foundational access layer. Key areas within the Protocol layer, led by an architecture group including Vitalik Buterin and Justin Drake, receive heightened emphasis. These include post-quantum security, zkEVM, formal verification, and long-term roadmap development ("Strawmap"). This reflects a shift towards tackling complex, interdependent challenges like scalability, privacy, and future-proofing the protocol, potentially moving from a pure "redundant security" multi-client model towards more specialized clients aided by AI-assisted formal verification. Financially, EF's budget is being reduced by approximately 40%. The goal is to transition from spending about 15% of its remaining funds annually to a more sustainable 5% rate, akin to a long-term endowment, ensuring its longevity. Concurrently, the restructuring involves pushing certain responsibilities—such as application development, adoption, and ecosystem coordination—to external organizations like EthLabs, the Ethereum Apps Guild, and others. This "multi-node" model aims to increase ecosystem resilience by decentralizing functions beyond the EF, though it introduces new coordination challenges. In essence, the reorganization represents EF consciously narrowing its scope to focus on the hardest, most critical protocol-level problems while fostering a more distributed and sustainable ecosystem structure for Ethereum's future growth.

Foresight NewsHace 13 min(s)

Don't Just Focus on Layoffs, The New Structure of the Ethereum Foundation is More Worthy of Appreciation

Foresight NewsHace 13 min(s)

Report Analysis: What Is Coherent Planning as CPO Booms?

Title: Report Interpretation: What Moves Is Coherent Making Amid the CPO Boom? Summary: JP Morgan analyst Samik Chatterjee reiterates an Overweight rating on Coherent (COHR), citing undervalued growth potential across three core areas: data center optical transceivers, co-packaged optics (CPO) chips, and industrial lasers/thermal management. COHR's 1.6T data center transceivers are in high demand, with pricing remaining firm. The rise of CPO is seen not as a threat but as a catalyst, creating higher demand for sophisticated optical components, an area where COHR holds a competitive edge with its comprehensive portfolio (lasers, isolators, VCSELs, thermoelectric coolers). Each CPO chip offers significantly greater revenue potential than traditional transceivers. Furthermore, its Optical Circuit Switch (OCS) technology targets a potential $4B market with reliability and power advantages. The company is expanding its InP (Indium Phosphide) device capacity fourfold within two years, securing substrate supply and transitioning to more cost-effective 6-inch wafers. As one of only two major suppliers of high-quality pump lasers—currently in severe shortage—COHR can now move up the value chain from components to complete line cards/systems, boosting ASP over tenfold. Gross margin targets (>42%) may be revised upward due to high-end product premiums, cost improvements from the wafer transition, and contributions from new high-margin products like CPO and OCS. Its efficient thermadite thermal material also offers long-term growth. Industrial segment revenue grows at a steady 5-10%, supported by semiconductor equipment orders. Changes in Apple's Face ID protocol present a re-competition opportunity for 3D sensing. Overall, Coherent is positioned as a key infrastructure provider, with AI-driven compute demand fueling the need for high-speed optical interconnectivity. Growth from CPO/OCS, stable industrial performance, and margin improvement support the bullish thesis. *Disclaimer: This summary interprets a third-party analyst report from JP Morgan. It does not constitute investment advice.*

marsbitHace 35 min(s)

Report Analysis: What Is Coherent Planning as CPO Booms?

marsbitHace 35 min(s)

Dan Koe's New Article: Escape the Fate of the Wage Slave, How to Survive in the Tide of AI Replacement?

Dan Koe's article argues that the real threat in the AI era is not technology itself, but financial dependency on employers. He critiques the "wage slavery" of unfulfilling work and identifies five core skills for resilience: agency, taste, persuasion, persistence, and iteration. These are developed not by consuming content, but by starting your own venture. The key to escaping the "employee mindset" is a radical identity shift. This requires: 1) Drastically changing your environment and daily inputs, 2) Choosing a creative medium (like content or code) that provides real-world feedback through trial and error, and 3) Using that feedback to learn and adapt. Koe strongly advocates for content creation over coding for beginners, as it builds irreplaceable subjective taste and audience connection. The practical starting point is a 15-minute self-inquiry to define your "life's work." Answer: What do you know deeply? What innate abilities do you have? What childhood interests were suppressed? Then, identify your contrarian beliefs—what does the mainstream get wrong in your area? The overlap is your unique direction. The final, non-negotiable step is to publish your first piece of content tomorrow. Embrace that it will be bad; the goal is to enter the feedback loop of creation, learning, and iteration, which is the true path to independence.

marsbitHace 45 min(s)

Dan Koe's New Article: Escape the Fate of the Wage Slave, How to Survive in the Tide of AI Replacement?

marsbitHace 45 min(s)

After Laying Off 20% of Staff, What Are the Key Points of EF's New Structure?

Following the completion of a months-long organizational restructuring, the Ethereum Foundation (EF) announced a 20% workforce reduction (approximately 54 employees) on June 23rd. It reorganized its teams into five new core clusters: Protocol, Access, User, Community, and Institutional (plus Operations/Management support units). Officially, this move implements the EF's 2026 Mandate and 2025 Treasury Management Policy, aiming to create a more focused and "self-sovereign" organization. The restructuring prioritizes the CROPS principles—Censorship Resistance, Openness & Freedom, Privacy, and Security—as foundational organizational tenets. The Protocol cluster will focus on core protocol R&D, including MEV reduction and zkEVM. The Access cluster emphasizes preserving user "zero option" for non-custodial, permissionless interaction. The User, Community, and Institutional clusters will manage external engagement, with the latter handling institutional and regulatory matters. While offering enhanced severance and transition support for affected employees, the EF did not disclose budget allocations or specific KPIs for the new clusters. This has led to market uncertainty about the impact on project funding and development priorities. Analysts note the announcement's positive tone of mission focus contrasts with a backdrop of recent EF leadership changes and broader ecosystem pressures. The true impact—whether this signifies strategic realignment or reactive contraction—will become clearer as the new structure's resource allocation and project prioritization are revealed in the coming months.

marsbitHace 46 min(s)

After Laying Off 20% of Staff, What Are the Key Points of EF's New Structure?

marsbitHace 46 min(s)

Top-Tier MEV Bot Loses $7.5 Million: Is 'Approval' the Most Overlooked Fatal Risk On-Chain?

The article discusses a sophisticated attack on a prominent Ethereum MEV (Miner Extractable Value) bot, Jaredfromsubway.eth, resulting in a loss exceeding $7.5 million. Unlike typical exploits involving key leaks or smart contract bugs, this attack was a carefully orchestrated "reverse hunt." The attacker spent weeks deploying fake tokens and liquidity pools that mimicked legitimate assets like WETH and USDC. These pools were designed to appear as profitable arbitrage opportunities, tricking the automated bot's trading logic. During its normal operation, the bot was induced to grant ERC-20 token approvals to the malicious contracts. Once sufficient permissions were accumulated, the attacker drained the bot's funds by calling these pre-approved allowances. This incident highlights the often-underestimated risks associated with token approvals in Web3. The article explains that approvals are a fundamental mechanism, allowing smart contracts (like DEXs) to move a user's tokens on their behalf. However, risks arise from practices like granting infinite approvals, the persistence of approvals even after disconnecting from a dApp, and the potential for a once-trusted contract to become compromised later. The piece concludes with advice for managing approval risks: users should adopt the principle of least privilege (approving only the needed amount), use separate wallets for storage versus interactions, and regularly audit and revoke unnecessary approvals using tools like Revoke.cash. It also emphasizes the role of wallets like imToken in providing proactive defenses, such as risk warnings and clear, readable transaction signing interfaces, to help users make informed decisions. Ultimately, wallet security must extend beyond private key protection to include active management of token approvals.

marsbitHace 51 min(s)

Top-Tier MEV Bot Loses $7.5 Million: Is 'Approval' the Most Overlooked Fatal Risk On-Chain?

marsbitHace 51 min(s)

Trading

Spot
Futuros
活动图片