Tsinghua '00s Alumnus Wang Guan's New Work: Disrupting Transformer Pretraining Models with 1/900 Tokens, 1/432 Compute Power

marsbitОпубликовано 2026-05-26Обновлено 2026-05-26

Введение

Tsinghua alumnus Wang Guan's team proposes HRM-Text, a novel pre-training paradigm using a Hierarchical Recurrent Model to replace standard Transformers. With just 1B parameters and 40B unique tokens trained at a cost of ~$1500, HRM-Text achieves performance comparable to 2B-7B open-source models, using up to 900x fewer tokens and 432x less estimated compute. Key innovations include a dual-timescale recurrent architecture for greater effective depth, a task-completion objective training only on answer tokens with PrefixLM masking, and techniques like MagicNorm and Warmup Deep Credit Assignment for stability. Evaluations show strong results on benchmarks like MMLU (60.7%) and GSM8K (84.5%). The work highlights how architectural priors and targeted objectives can lower pre-training barriers, though limitations include knowledge-reasoning coupling, fixed compute per token, and scalability beyond 3B parameters.

Breaking the traditional paradigm of large model pretraining, Tsinghua '00s alumnus Wang Guan's team has released a new work:

They used a Hierarchical Recurrent Model (HRM) to replace the standard Transformer, proposing HRM-Text, an efficient pretraining method that goes beyond Scaling.

Paper link: https://arxiv.org/abs/2605.20613

While using approximately 100-900x fewer training tokens and 96-432x less estimated compute compared to the standard baseline model, HRM-Text still achieved performance comparable to open-source models with 2B to 7B parameters.

Furthermore, using 1B parameters, 40B non-repeating tokens, and a training cost of about $1500, HRM-Text achieved the following scores on mainstream benchmarks: MMLU 60.7%, ARC-C 81.9%, DROP 82.2%, GSM8K 84.5%, MATH 56.2%.

Figure|Pretraining efficiency.

Based on this, they clearly propose: Structural priors and targeted training objectives can significantly lower the barrier to pretraining. This training approach makes training a foundational model from scratch feasible.

How is HRM-Text designed?

Large Language Model (LLM) pretraining is increasingly reliant on a few institutions with ample computing and data resources. Training a competitive foundational model often requires trillions of tokens, thousands of GPUs, and even tens of millions of dollars in compute investment.

However, the current training paradigm is inefficient. A large amount of computation is consumed on irrelevant tokens such as prompts, formatting padding, and web noise, resulting in a significant portion of training compute not directly serving inference.

In this work, the research team redesigned the architecture and training objective, making the pretraining of HRM-Text relatively more efficient.

Architecture: Employs a Hierarchical Recurrent Model with dual timescales, splitting computation into a slow H module and a fast L module. While a standard Transformer performs one forward pass per token, HRM performs multiple rounds of recursive updates on the same token. The H and L modules each account for only half of the recursive core parameters. The overall computational load is roughly equivalent to performing 4 recursive unrolls on the same set of parameters, increasing computational depth without adding more parameters.

Training Objective: Abandons the standard full-text autoregressive pretraining. Instead, training is performed directly on instruction-answer pairs, with loss calculated only on the answer part, combined with PrefixLM masking, allowing bidirectional attention on the instruction part and causal masked generation for the answer part.

Figure|HRM-Text architecture.

To enhance the stability of recursive training, the research team introduced MagicNorm and Warmup Deep Credit Assignment.

MagicNorm is a hybrid normalization strategy that leverages the asymmetry between forward and backward computation depths under Truncated Backpropagation Through Time (Truncated BPTT). It uses PreNorm internally within modules and adds an extra normalization layer at the module exit, thereby improving the stability of deep recursive training.

Warmup Deep Credit Assignment, on the other hand, only backpropagates gradients from the last 2 recursive steps during the initial training phase, then linearly extends to the last 5 steps. This training mechanism allows the model to converge stably on shorter credit assignment paths before gradually introducing longer dependencies.

How effective is it?

Experimental results show that HRM-Text demonstrates clear advantages in architecture efficiency, training objectives, and overall performance.

1. Under fixed training compute, is the recurrent architecture more effective?

Results show that under FLOPs-aligned conditions, HRM 1B outperforms Transformer 1B, Transformer 3B, Looped Transformer 1B, and RINS 1B on most benchmarks; comparison with TRM also indicates that HRM training is more stable.

Figure|Comparison of performance and stability with Transformer models. HRM maintained stable training dynamics across all scales, while Transformer models exhibited severe instability at the 1-billion-parameter scale. Furthermore, at the 0.6B scale, HRM achieved competitive performance on most benchmarks using only 2x less computation than Transformer models.

2. Does the task completion objective and PrefixLM help?

Ablation studies show that under FLOPs-aligned conditions, the MMLU score for a 1B Transformer increased from 40.55 with standard autoregressive training, to 47.72 after introducing the task completion objective, to 53.15 after adding PrefixLM, and finally to 60.73 after switching to the HRM architecture.

Figure|Performance comparison between different model architectures and training objectives

3. How does HRM-Text's efficiency compare to contemporary open models?

HRM-Text 1B achieved scores of 60.7, 81.9, 82.2, 84.5, and 56.2 on MMLU, ARC-C, DROP, GSM8K, and MATH respectively. Compared to open models that generally have larger training budgets, it entered the performance range of 2B to 7B open-source models using only 40B unique tokens and 1B parameters; training required up to 900x fewer tokens and up to 432x less compute.

Figure|Evaluation results of HRM-Text 1B compared with contemporary fully open-source models and open-weight models

4. Does the recurrent structure bring greater effective depth?

Results show that the standard Transformer and Looped Transformer stabilize at shallower depths, while HRM maintains more pronounced inter-block representation changes, lower cosine similarity, and higher logit lens KL values even at deeper layers.

Figure|Effective depth analysis.

Figure|Layer-wise Logit Lens KL analysis.

Limitations and Future Directions

Although HRM-Text demonstrates strong performance on inference-intensive tasks, this method still has limitations, and future research directions are proposed.

1. Towards Decoupling "Knowledge" and "Reasoning"

Currently, broader factual knowledge coverage still depends more on model scale and data breadth. HRM-Text was only trained on 40B unique tokens, and explicit knowledge sources constitute only part of the task-formatted mixed data. In the future, researchers need to design a compact reasoning core separately from external fact storage, delegating knowledge breadth to curated corpora, retrieval-augmented modules, or learnable memory.

2. Adaptive Computation Time

The recurrent scheduling of HRM-Text brings greater effective serial depth, but this also means the model must execute a fixed number of recursive steps during inference. A promising future direction is to introduce an adaptive computation time mechanism, allowing simple samples to stop computation earlier and reserving the full recursive budget for difficult samples, thereby reducing inference cost.

3. Current Scaling Validation Scope Remains Limited

The current scaling experiments only cover up to the 3B parameter Transformer control group and the 1B parameter HRM-Text. The research team states that it remains to be verified by subsequent work whether similar efficiency advantages can be maintained at larger model scales.

4. PrefixLM and Inference Frameworks

Currently, PrefixLM still faces certain engineering implementation constraints in practical deployment. Although it can run on standard text generation inference frameworks like vLLM, this requires the framework to support custom attention masks during the prefill stage. Extending it to multi-turn dialogue scenarios further requires designing a KV-cache mechanism that ensures bidirectional visibility within user segments while maintaining causal constraints for the assistant's generation process.

For more technical details, please refer to the original paper.

This article comes from the WeChat public account "Academic Headlines" (ID: SciTouTiao), author: Xia Qiansi

Связанные с этим вопросы

QWhat is HRM-Text and how does it differ from the standard Transformer architecture for pre-training large language models?

AHRM-Text is an efficient pre-training model proposed by a Tsinghua University research team. It uses a Hierarchical Recurrent Model (HRM) instead of the standard Transformer. The key difference is that HRM employs a two-timescale hierarchical recurrence, where each token undergoes multiple recursive updates (via slow 'H' and fast 'L' modules), increasing computational depth without adding parameters. This contrasts with the Transformer's single forward pass per token.

QAccording to the article, what are the key efficiency claims of HRM-Text in terms of training tokens and computational cost?

AThe article claims HRM-Text achieves performance comparable to 2B to 7B parameter open-source models while using approximately 100-900 times fewer training tokens and 96-432 times less estimated computational power compared to standard baseline models. A specific example is a 1B parameter model trained on 40B unique tokens at a cost of around $1,500.

QWhat are the two main design choices in HRM-Text's training objective that contribute to its efficiency?

AThe two main design choices in the training objective are: 1) Training directly on instruction-answer pairs and computing the loss only on the answer part, rather than using standard full-sequence autoregressive pre-training. 2) Employing PrefixLM masking, which allows bidirectional attention on the instruction (prefix) part and causal masking for generating the answer.

QWhat techniques did the researchers introduce to improve the stability of deep recurrent training in HRM-Text?

ATo improve stability for deep recurrent training, the researchers introduced two techniques: 1) MagicNorm, a hybrid normalization strategy using PreNorm inside modules and an extra normalization at the module output, leveraging asymmetry in forward/backward depths under Truncated BPTT. 2) Warmup Deep Credit Assignment, which initially backpropagates gradients only from the last 2 recursion steps and linearly extends to the last 5 steps during training.

QWhat are some of the limitations and future research directions mentioned for HRM-Text?

AThe mentioned limitations and future directions include: 1) Decoupling 'knowledge' and 'reasoning', suggesting a need to combine the compact reasoning core with external factual storage (e.g., curated corpora, retrieval-augmented modules). 2) Exploring Adaptive Computation Time to reduce inference cost for easier samples. 3) Validating the efficiency advantage at larger model scales beyond the current 3B/1B experiments. 4) Addressing engineering challenges for deploying PrefixLM in multi-turn dialogue, such as designing a suitable KV-cache mechanism.

Похожее

After Burning Tens of Billions of Dollars in Tokens, Silicon Valley Giants Start Limiting Employee Token Usage

After burning tens of billions of dollars on AI tokens, major Silicon Valley firms are now restricting employee usage. Companies like Microsoft, Uber, and Salesforce, which heavily promoted AI for "efficiency," are facing a cost crisis. The practice of "tokenmaxxing"—pushing employees to maximize AI tool usage—led to wasteful spending on trivial tasks like checking the weather or writing birthday messages, with studies showing significant hidden costs for bug fixes and code rewrites. The core issue is a misalignment between individual productivity gains and actual business value. While employees use AI to automate tasks they dislike, such as writing reports, this often doesn't translate to increased company revenue or improved core business outcomes. For instance, AI-generated code speeds up development but also sees an 800% increase in "code churn" (code being discarded or rewritten). As a result, only 14% of CFOs report seeing a clear, measurable return on AI investments. Firms are now shifting strategies. Microsoft has revoked most internal licenses for Claude Code, while others are implementing monitoring and cost controls. New tools from companies like Harness and CloudZero aim to track AI spending and tie costs to business results. Some AI vendors, like HubSpot, are moving from token-based pricing to charging based on outcomes, such as "resolved conversations" or "leads generated." This represents a necessary correction in the AI adoption cycle. The challenge now is for companies to move beyond using AI merely to speed up old tasks and instead rethink their workflows and business models fundamentally. The future of enterprise AI depends on proving its value, not just its usage.

marsbit6 мин. назад

After Burning Tens of Billions of Dollars in Tokens, Silicon Valley Giants Start Limiting Employee Token Usage

marsbit6 мин. назад

I've Been a VC in Web3 for Nine Years: Asian Funds Are Experiencing "Hell Mode"

After nine years as a Web3 VC, the author observes a severe downturn in Asia's crypto venture capital scene, with many funds disappearing or pivoting away. The market has cooled dramatically since the 2021-2024 frenzy, leading to fewer deals and active investors. IOSG Ventures, a firm that has endured three market cycles, has adapted its strategy: shifting from 80-90% early-stage investments to a 50% early-stage, 30% post-TGE, and 20% OTC portfolio to find better value and liquidity. The current bear market is described as "hell mode" for Asian funds due to scarce LP capital, forcing extreme precision in targeting only top projects. The author argues the core industry problem has been the disconnect between tokens and real value, where tokens served as fundraising tools without granting holders rights to protocol revenue. A positive shift is emerging where projects like Uniswap and Morpho are programmatically binding token value to protocol profits. Investment focus has moved towards fundamentals: real-yield financial infrastructure (stablecoins, lending) and crypto-native AI infrastructure, while avoiding narrative-driven projects. The conclusion is that true, durable companies are born in pessimistic times when focus shifts to real user needs and sustainable business models. The industry's future will be shaped by those who remain after the泡沫 dissipates.

marsbit31 мин. назад

I've Been a VC in Web3 for Nine Years: Asian Funds Are Experiencing "Hell Mode"

marsbit31 мин. назад

Cango Releases Q1 Financial Report: Total Revenue of $102 Million, Business Expands into AI Computing Infrastructure

Cango Releases Q1 2026 Financial Results: Total Revenue of $102 Million, Business Expands into AI Compute Infrastructure Bitcoin mining company Cango reported unaudited financial results for Q1 2026. While bitcoin mining remains its core revenue driver, the company is strategically expanding into energy and AI compute infrastructure. **Key Financial & Operational Highlights:** * **Revenue & Performance:** Total revenue for the quarter was $102 million, with $98.4 million coming from bitcoin mining. However, the company reported a net loss of $261.1 million, primarily attributed to non-cash impacts like bitcoin price declines leading to miner impairments and fair value losses on its bitcoin holdings. Notably, long-term debt was significantly reduced to $30.6 million from $557.6 million at the end of 2025. * **Mining Operations:** Cango's total hash rate was 37.01 EH/s. It mined 1,266 bitcoin during the quarter and reduced its average cash cost per bitcoin by 9.0% quarter-over-quarter to $76,928, demonstrating improved operational efficiency. * **AI Business Expansion:** The company introduced EcoHash, a new commercial platform. This initiative leverages Cango's existing expertise in energy management and high-density computing to provide infrastructure for AI workloads, starting with GPU compute leasing. Management emphasized executing a disciplined strategy to strengthen the core mining business while advancing AI infrastructure through EcoHash. They highlighted progress in cost reduction, stable global operations, and a strengthened balance sheet through debt reduction.

marsbit31 мин. назад

Cango Releases Q1 Financial Report: Total Revenue of $102 Million, Business Expands into AI Computing Infrastructure

marsbit31 мин. назад

Торговля

Спот
Фьючерсы

Популярные статьи

Как купить S

Добро пожаловать на HTX.com! Мы сделали приобретение Sonic (S) простым и удобным. Следуйте нашему пошаговому руководству и отправляйтесь в свое крипто-путешествие.Шаг 1: Создайте аккаунт на HTXИспользуйте свой адрес электронной почты или номер телефона, чтобы зарегистрироваться и бесплатно создать аккаунт на HTX. Пройдите удобную регистрацию и откройте для себя весь функционал.Создать аккаунтШаг 2: Перейдите в Купить криптовалюту и выберите свой способ оплатыКредитная/Дебетовая Карта: Используйте свою карту Visa или Mastercard для мгновенной покупки Sonic (S).Баланс: Используйте средства с баланса вашего аккаунта HTX для простой торговли.Третьи Лица: Мы добавили популярные способы оплаты, такие как Google Pay и Apple Pay, для повышения удобства.P2P: Торгуйте напрямую с другими пользователями на HTX.Внебиржевая Торговля (OTC): Мы предлагаем индивидуальные услуги и конкурентоспособные обменные курсы для трейдеров.Шаг 3: Хранение Sonic (S)После приобретения вами Sonic (S) храните их в своем аккаунте на HTX. В качестве альтернативы вы можете отправить их куда-либо с помощью перевода в блокчейне или использовать для торговли с другими криптовалютами.Шаг 4: Торговля Sonic (S)С легкостью торгуйте Sonic (S) на спотовом рынке HTX. Просто зайдите в свой аккаунт, выберите торговую пару, совершайте сделки и следите за ними в режиме реального времени. Мы предлагаем удобный интерфейс как для начинающих, так и для опытных трейдеров.

1.4k просмотров всегоОпубликовано 2025.01.15Обновлено 2025.03.21

Как купить S

Sonic: Обновления под руководством Андре Кронье – новая звезда Layer-1 на фоне спада рынка

Он решает проблемы масштабируемости, совместимости между блокчейнами и стимулов для разработчиков с помощью технологических инноваций.

2.3k просмотров всегоОпубликовано 2025.04.09Обновлено 2025.04.09

Sonic: Обновления под руководством Андре Кронье – новая звезда Layer-1 на фоне спада рынка

HTX Learn: Пройдите обучение по "Sonic" и разделите 1000 USDT

HTX Learn — ваш проводник в мир перспективных проектов, и мы запускаем специальное мероприятие "Учитесь и Зарабатывайте", посвящённое этим проектам. Наше новое направление .

1.8k просмотров всегоОпубликовано 2025.04.10Обновлено 2025.04.10

HTX Learn: Пройдите обучение по "Sonic" и разделите 1000 USDT

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на S (S) представлены ниже.

活动图片