Deforming the Transformer, LLMs Become Smarter

marsbitPublished on 2026-06-29Last updated on 2026-06-29

Abstract

A new research paper proposes "Tapered Language Models (TLMs)," a method that improves large language model performance without adding any parameters. It challenges the standard Transformer design where each layer has the same number of parameters ("feed-forward network" width). Building on evidence that layers are not equally important—earlier layers handle foundational information like grammar, while later layers often reinforce existing judgments—the researchers suggest reallocating model capacity from later to earlier layers. The core idea is to make the layer width taper off monotonically from start to end, keeping total parameters and compute constant. Experiments compared linear, cosine, and sigmoid tapering curves on a 440M parameter model. The cosine curve (e.g., starting width 1.5x baseline, ending 0.5x) achieved the best result, reducing perplexity by 1.84 points compared to the uniform baseline—a significant gain at zero cost. This finding proved robust across four different model architectures (including gated attention and memory-augmented models) and at larger scales (760M and 1.3B parameters), consistently improving performance on commonsense reasoning and language modeling tasks without harming long-context retrieval ability. The work highlights a long-overlooked design dimension: optimal parameter allocation across depth. It offers a "free lever" for efficiency, potentially applicable beyond language models to vision Transformers and diffusion models. The...

June 2026, the large model industry is experiencing an unprecedented 'open-source tsunami': NVIDIA released a 550B-parameter hybrid architecture model, Google gifted a new version of the multimodal Gemma, and Zhipu AI fully open-sourced its flagship model under the most permissive license.

Almost all vendors tell the same story: use a Mixture of Experts (MoE) structure to pack in more parameters, use sparser activation to lower costs, and use elastic network widths to match different deployment scenarios.

In other words, the entire industry is desperately researching 'how to cram more parameters into the same compute budget.'

But a new paper from researchers at Mila, Cornell University, and the University of Montreal poses a question in almost the opposite direction: What happens if we don't add a single parameter, but simply 'reposition' the parameters already existing in the model?

Paper Title: Tapered Language Models Paper Link: https://arxiv.org/abs/2606.23670

Background: The Overlooked 'Uniform Treatment'

Since the 2017 paper 'Attention Is All You Need' that pioneered the Transformer, almost all language models share the same skeleton, whether it's the classic Transformer, later gated attention, recurrent memory networks, or even new architectures with 'test-time memory' capabilities. That is: stacking several structurally identical 'layers,' with each layer allocated exactly the same number of parameters.

This is like a chain restaurant where every location, whether downtown or suburban, is equipped with the same number of chefs and kitchen equipment, completely ignoring differences in customer flow. This 'uniform' allocation method is convenient and easy to maintain, but not necessarily optimal.

In recent years, more and more research has pointed out from different angles that model layers are not equally important.

'Early Exit' experiments show that often the model's answer is basically finalized before reaching the last layer;

'Layer Pruning' studies find that cutting out some of the later layers has almost no effect on model performance;

Interpretability research finds that shallow networks capture 'basic information' like grammar, while deep networks handle 'advanced information' like semantics.

In other words, the layers differ vastly from each other, yet parameter allocation remains uniform.

This is precisely the core question raised by the paper: Since the varying importance of layers has long been proven, why should their 'brain capacity' still be evenly distributed?

Moving 'Brain Capacity' Forward

The research team first conducted a simple and crude validation experiment: they divided the layers of a 440M-parameter Transformer model into early, middle, and late groups. Keeping the total parameter count constant, they made the 'Feed-Forward Network' (FFN, the core component of each layer responsible for storing and processing information, which can be understood as the 'working memory capacity' of each layer) of one group wider and the others narrower.

The result was very clear: The 'top-heavy' allocation concentrating capacity in the front segment lowered the model's perplexity (a metric measuring language model prediction accuracy; lower values indicate more accurate predictions) on the validation set from 16.28 to 15.96. Conversely, concentrating capacity in the back segment caused perplexity to soar to 17.29.

With the same total parameters, merely due to different placement, the performance difference was over a full point—a significant gap in language model evaluation.

This finding directed the question to a more granular direction: Instead of using a 'one-size-fits-all' three-segment grouping, could we use a smoother curve to gradually decrease capacity from front to back?

The researchers named this concept 'Tapered Language Models' (TLMs): select any dimension in the model that determines parameter count (e.g., the width of the feed-forward network) and make it monotonically decrease along the depth direction, while ensuring the average width of all layers still equals the original fixed value.

Thus, the total parameter count and computation remain completely unchanged, only the distribution shape changes from a 'rectangle' to a 'wedge.'

The team tried three decreasing curves: linear decrease, cosine decrease, and S-shaped (Sigmoid) decrease.

The differences between these three curves are analogous to three different ways of 'closing up shop':

Linear decrease is like closing at a constant rate, shutting down roughly the same number of counters each period;

S-shaped decrease is like suddenly announcing closure, with most stalls remaining as is, only a small middle segment contracting rapidly;

Cosine decrease lies between the two, transitioning gently at both ends, gradually tightening in the middle. It neither 'cuts losses' abruptly at the ends nor exerts uniform force and misses the area that should contract the most.

Experimental Results: Free 1.84 Points

After scanning combinations of five width ratios and three curves on the 440M-parameter Transformer, cosine decrease emerged as the clear winner: under the optimal configuration (front width 1.5 times baseline, back width 0.5 times baseline), perplexity dropped from the uniform distribution baseline of 16.28 to 14.44, a full improvement of 1.84 points, all without adding a single parameter or an extra floating-point operation.

More crucially, this conclusion isn't just luck for one particular architecture.

The research team ported the exact same configuration (cosine decrease, front/back width ratio 1.5/0.5) to three other structurally distinct architectures: a gated attention model, Hope-attention with 'self-modifying memory' capability, and the Titans architecture with neural long-term memory modules. They then re-validated at two larger scales: 760M and 1.3B parameters.

The result: Across all eight comparison sets of the four architectures and two scales, the 'tapered' models showed improved average accuracy on commonsense reasoning benchmarks and improved perplexity on the LAMBADA language prediction task.

The researchers also conducted additional long-text retrieval tests (Needle-in-a-Haystack), confirming that this redistribution does not sacrifice the model's ability to handle long contexts.

To explain the reasons behind this phenomenon, the team also measured the similarity between the output of each 'Feed-Forward Network' layer in GPT-2 series models and the existing information flow, revealing a clear pattern: The deeper into the model, the more similar the newly written content of each layer is to the existing information. In other words, later layers are more about 'reiterating' existing judgments rather than 'creating' new understandings.

This precisely confirms why moving capacity from the back to the front is reasonable: the front layers can truly utilize this extra 'brain capacity,' while the back layers cannot.

Conclusion

This research essentially proposes a simple yet long-overlooked proposition: a model's capacity should not be a resource uniformly splashed out but should flow to where it's truly needed.

In a 2026 where the entire industry is competing over 'who has more parameters' and 'whose architecture is sparser,' this paper offers an almost zero-cost alternative: no need to change architectures, no need to add parameters, just change the 'shape' of the distribution.

The researchers also frankly state that the current optimal configuration was tuned on a 440M-parameter model. Whether there are 'special recipes' more suitable for different scales and architectures remains an open question.

But more noteworthy is that the paper points out this line of thinking is not limited to language models—Vision Transformers, diffusion models, and multimodal models almost all inherit the same default setting of 'equal distribution per layer.' If the shape of capacity distribution itself is a long-overlooked design dimension, then this 'free lever hidden in plain sight' may have only just been noticed.

Team Introduction

The paper was completed jointly by Reza Bayat from Mila (Montreal Institute for Learning Algorithms), Ali Behrouz from Cornell University, and Aaron Courville, co-founder of Mila and professor at the University of Montreal.

Ali Behrouz is currently a researcher at Google Research and a PhD student at Cornell University. Over the past two years, he has participated in designing several new architectures that have garnered widespread attention, including the Titans architecture capable of 'learning and remembering during test time,' as well as the subsequent Atlas and 'Nested Learning' framework. He has long focused on how to make models utilize and store long-term context information more efficiently.

Aaron Courville is a senior scholar in the deep learning field, a CIFAR AI Chair. He has long collaborated with Yoshua Bengio in promoting fundamental deep learning research, with deep expertise in representation learning and generative models. He is also one of the authors of Generative Adversarial Networks (GANs) and co-authored the classic book 'Deep Learning' with Ian Goodfellow and Bengio.

This article is from WeChat Official Account 'Jiqizhixin' (ID: almosthuman2014), author: Following AI

Related Questions

QWhat is the core concept of 'Tapered Language Models (TLMs)' proposed in the research?

ATapered Language Models (TLMs) propose a non-uniform parameter allocation strategy for transformer layers. Instead of giving each layer an equal number of parameters, TLMs allocate more capacity (e.g., a wider feed-forward network) to the earlier layers and gradually decrease it towards the later layers, forming a wedge-shaped or tapered distribution, while keeping the total number of parameters and computational cost unchanged.

QAccording to the article, what was the key experimental finding that demonstrated the effectiveness of moving capacity to earlier layers?

AIn a key experiment, researchers redistributed parameters in a 440M parameter Transformer by widening the Feed-Forward Network (FFN) of early layers and narrowing the later ones, while keeping the total constant. This 'head-heavy' configuration lowered the model's perplexity from 16.28 to 15.96. Conversely, concentrating capacity in the later layers worsened perplexity to 17.29, showing that earlier layers benefit more from extra parameters.

QWhich decreasing curve (linear, cosine, or sigmoid) performed best in the TLM experiments and what was the resulting performance improvement?

AThe cosine decreasing curve performed best in the TLM experiments. In the optimal configuration for a 440M parameter model (early layers 1.5x wider, later layers 0.5x narrower), it reduced perplexity from the uniform baseline of 16.28 to 14.44, achieving an improvement of 1.84 points without adding any parameters or computations.

QWhat underlying reason does the research suggest for why later layers need fewer parameters?

AThe research suggests that later layers in a transformer tend to produce outputs that are more similar to the existing information flow, meaning they are often 'reiterating' or 'emphasizing' earlier judgments rather than 'creating' fundamentally new understandings. Since they perform less novel computation, they require less parameter capacity compared to the earlier layers which handle more foundational processing.

QWho are the main authors of this research paper on Tapered Language Models?

AThe main authors are Reza Bayat from Mila, Ali Behrouz from Cornell University (and Google Research), and Aaron Courville, a professor at the University of Montreal and a co-founder of Mila. Aaron Courville is also a co-author of the seminal book 'Deep Learning' and a contributor to Generative Adversarial Networks (GANs).

Related Reads

Lightning Fast Five-Whip Combo! Strategy's Self-Rescue Plan Officially Released

Strategy, amidst the STRC de-pegging crisis, has unveiled its "Digital Credit Capital Framework" self-rescue plan. The five-part framework includes: 1) **Cash Reserves**: Management of ~$2.55B in USD reserves, dedicated solely to covering ~17.4 months of preferred stock dividends and debt interest, with a 12-month minimum coverage floor. 2) **Dividend Policy**: STRC's dividend yield rises to 12% from July 1st, with monthly reviews. Strategy clarifies de-pegging does not automatically trigger further hikes. 3) **Preferred Stock Buyback**: A $1B authorization, prioritizing STRC repurchases to support its price, reduce future dividend obligations, and signal commitment, using funds separate from dividend reserves. 4) **Common Stock Buyback**: A separate $1B authorization for MSTR stock, aimed at creating shareholder value when the stock is deemed undervalued, establishing a two-way capital management mechanism. 5) **Bitcoin Monetization**: Formal authorization to sell BTC (up to $1.25B earmarked) to build USD reserves, cover dividends/interest, or fund buybacks, marking a strategic shift where BTC becomes a managed asset rather than a strictly "hold-only" reserve. Market reaction saw MSTR and STRC shares rise pre-market, while BTC remained stable. The plan aims to restore confidence in STRC, ensure dividend sustainability, and reopen Strategy's funding channels.

Odaily星球日报45m ago

Lightning Fast Five-Whip Combo! Strategy's Self-Rescue Plan Officially Released

Odaily星球日报45m ago

The Sword of Damocles Over the AI Bull Market: Not Just in South Korea, Leverage in U.S. Stocks Is Equally Staggering

Global equity markets are hitting new highs driven by the AI boom, but the fuel behind this rally is becoming increasingly dangerous. From the US to South Korea, margin debt and leveraged ETF assets have soared to historical extremes, with their pro-cyclical nature amplifying tail risks in market volatility. In the US, margin debt rose 54% year-over-year in May, reaching a record $1.4 trillion. Simultaneously, leveraged ETF assets nearly doubled in under 70 days to over $220 billion by early June, with intense focus on tech, semiconductor indices, and single stocks like NVIDIA and Tesla. A warning sign appeared in South Korea, where the KOSPI index experienced extreme volatility, plunging 10% to trigger a circuit breaker, then sharply rebounding before halting again, partly driven by concentrated, highly leveraged positions in chip stocks. Analysts are raising alarms. Barclays warns that leveraged funds have accumulated roughly $300 billion in equity-linked derivatives since late March, creating a major source of non-discretionary risk. Morgan Stanley notes an unprecedented reliance on leveraged financing by marginal buyers, with financing becoming more expensive and scarce. Charles Schwab has tightened margin requirements. The core risk lies in the mechanics: leveraged ETFs and derivatives can create a "tail wags the dog" effect, where fund flows force market makers to buy underlying stocks, amplifying gains. This process reverses in a downturn, triggering a self-reinforcing selling spiral as funds deleverage. Additionally, the cost of borrowing to buy stocks has spiked to multi-year highs. Morgan Stanley warns this sets up a nonlinear risk: high financing costs stall momentum, a price decline triggers forced deleveraging, and selling pressure is multiplied by leverage, potentially leading to outsized declines. The current market breadth is narrow, with gains heavily concentrated in tech, making the rally vulnerable to a pullback in leveraged positions. In summary, the AI-fueled bull market is increasingly propped up by record leverage. When this trend reverses, the deleveraging process could magnify losses, posing a significant threat to financial stability.

marsbit54m ago

The Sword of Damocles Over the AI Bull Market: Not Just in South Korea, Leverage in U.S. Stocks Is Equally Staggering

marsbit54m ago

Strategy Launches 'Digital Credit Capital Framework': Authorizes Sale of $12 Billion in Bitcoin, Ending the 'Never Sell' Script

Strategic, the world’s largest corporate holder of Bitcoin (formerly MicroStrategy), has dramatically shifted its long-standing “never sell Bitcoin” strategy by announcing a new “Digital Credit Capital Framework” on June 29. This plan authorizes the sale of up to $1.25 billion worth of Bitcoin to raise cash, establishes a $2.55 billion USD reserve, increases the dividend rate on its STRG preferred shares to 12%, and authorizes up to $1 billion each for repurchases of its own digital credit securities and Class A common stock. This pivot comes amid severe financial pressure. The company’s STRG preferred shares are trading at a ~24% discount to their $100 face value, making new issuances difficult and stalling its buy-Bitcoin funding flywheel. Its annualized dividend obligation has surged to ~$1.2 billion. Meanwhile, its MSTR stock has plummeted 36% in eight days, erasing its traditional premium over its Bitcoin holdings per share. In recent weeks, Strategic has already shifted focus from accumulating Bitcoin to bolstering cash reserves by selling its own MSTR shares. The new framework formalizes this defensive turn, aiming to ensure liquidity, cover dividends, and support its securities prices through buybacks. However, the move risks triggering a “death spiral” if Bitcoin sales pressure the market, further devaluing the company’s core asset. The company also faces a potential securities investigation and carries significant debt, with Bitcoin’s current price below its average acquisition cost.

marsbit1h ago

Strategy Launches 'Digital Credit Capital Framework': Authorizes Sale of $12 Billion in Bitcoin, Ending the 'Never Sell' Script

marsbit1h ago

Trading

Spot
活动图片