Major AI Collaboration Breakthrough! Stanford and NVIDIA Jointly Eliminate AI Communication Overhead, Boosting Reasoning Speed by 2.4x

marsbitОпубликовано 2026-05-21Обновлено 2026-05-21

Введение

Title: AI Collaboration Breakthrough: Stanford & NVIDIA Eliminate Communication Overhead, Boost Reasoning Speed by 2.4x A new approach called RecursiveMAS, developed by UIUC, Stanford, NVIDIA, and MIT, tackles the major bottleneck in multi-agent AI systems: the "language tax." Currently, AI agents collaborate by generating and reading natural language text, a slow, costly, and information-lossy process akin to inefficient radio communication. RecursiveMAS bypasses this by enabling agents to communicate directly through their "thoughts"—latent space vector representations—instead of text. Inspired by recursive language models, it treats each agent like a reusable layer in a recursive loop. A special lightweight module called RecursiveLink passes these high-dimensional, semantic-rich internal states between agents. Only the final agent decodes the last latent representation into human-readable text. This process, described as "telepathic" communication, dramatically cuts the overhead of encoding and decoding text at each step. The system is highly efficient; the core AI model weights remain frozen, and only the small RecursiveLink modules are trained, requiring updates to just 0.31% of total parameters. This reduces training costs by over 50% compared to full fine-tuning. Comprehensive evaluations across math, science, coding, and QA benchmarks show significant improvements: - **Accuracy:** Average increase of 8.3%, with gains up to 18.1% on complex math problems (AIME2025)...

Imagine a scenario: you have three AI assistants collaborate to solve a math problem.

The traditional approach is: the first AI "writes" out the solution idea, the second AI "reads" it and writes a new idea, and the third AI "reads" and "writes" again.

This process is like three people taking turns using walkie-talkies to relay information, each time having to "translate" thoughts in their mind into language, and the other party "translating" the language back into thoughts. Is it slow? Yes. Is it costly? Yes. Even worse, this "translation" process loses information—what you think in your mind and what you say are often not the same thing.

This is the core dilemma faced by current multi-agent AI systems: "Language Tax."

Recently, a joint team from UIUC, Stanford, NVIDIA, and MIT proposed a new approach—RecursiveMAS. It allows AIs to skip the "speaking" step and communicate directly with "thoughts." In tests, reasoning speed increased by 2.4x, and token consumption was reduced by 75%.

(Paper link: https://arxiv.org/abs/2604.25917)

The Dilemma of AI Meetings: Efficiency Wasted on "Talking"

Over the past two years, multi-agent systems have become one of the hottest research directions in the AI field. From OpenAI's Swarm to Microsoft's AutoGen, from LangGraph to CrewAI, various players are exploring how to make multiple AIs collaborate to solve complex tasks that a single model cannot handle alone. However, in these systems, the collaboration efficiency of multiple agents is always constrained by a fundamental assumption—agents must communicate through natural language text.

When you have a "math expert" and a "code reviewer" collaborate, the whole process seems "reasonable," but breaking it down reveals many problems:

Each information transfer involves a double conversion: internal thought → text → internal thought. The tokens consumed in this process are not just money, but also precious computational resources and time. More crucially, this "write-out then read-in" process loses information—the rich semantics the model compresses into text during decoding cannot be fully recovered by the next model upon re-decoding. In a workflow involving five Agents, the time overhead for text encoding/decoding often accounts for over 60% of the total latency.

Even more troubling is that this paradigm lacks a clear "knob" for systematic optimization—add more agents? Marginal returns diminish, and communication overhead increases exponentially. Increase context window? Token costs explode. Increase model parameters? Individual agents become stronger, but collaboration efficiency doesn't improve fundamentally—it's like giving a group of people better walkie-talkies each, but they still have to read text aloud one by one; the communication method hasn't changed, so even if everyone is smarter, overall efficiency cannot have a breakthrough. Industry solutions, whether prompt engineering or LoRA fine-tuning, can only alleviate symptoms to some extent, unable to cure this fundamental architectural problem.

RecursiveMAS: Replacing "Walkie-Talkies" with "Telepathy"

The core idea of RecursiveMAS is very clever: since language is the bottleneck, then don't use language.

It draws inspiration from the idea of Recursive Language Models. In traditional language models, data flows from the first layer to the last, linearly; the more layers, the more parameters. Recursive language models do the opposite—instead of adding layers, they repeatedly cycle the same set of layers, letting data "circulate" back and forth between layers. Each pass through this set of layers is equivalent to an additional round of "thinking," deepening the reasoning depth without increasing parameter count.

RecursiveMAS extends this idea from "within a single model" to "multi-agent systems":

Each agent is like a layer in a recursive language model; they no longer generate text but pass "thoughts"—a continuous, vector representation existing in the latent space.

The researchers used a poetic analogy: "agents communicating telepathically as a unified whole."

Specifically, Agent A1 processes and passes its latent representation to Agent A2, A2 processes and passes to A3... until the last Agent processes, and its latent output is directly fed back to A1, starting a new round of recursive iteration. The entire process occurs entirely in latent space; only at the last Agent of the final round is the final latent representation decoded into text output. This is like a group of experts sitting around a table, not speaking, not writing notes; each person simply thinks silently and directly passes the "thought result" in their mind to the next person—the whole process is quiet and efficient.

Figure: RecursiveMAS architecture schematic—Multi-Agents achieve closed-loop recursive collaboration via embedding space (Source: arXiv)

A key component of this system is called RecursiveLink, a lightweight two-layer residual module responsible for preserving and transforming a model's latent layer representation and passing it to the next model's embedding space. The latent state of the language model's last layer already encodes rich semantic reasoning information; what RecursiveLink does is completely "move" these high-dimensional information over, rather than first translating it into text and then interpreting it. It comes in two versions: inner and outer.

Figure: Recursive learning process—Inner and outer links co-train (Source: arXiv)

In terms of training strategy, RecursiveMAS has a clever design: the backbone model weights are completely frozen; only the RecursiveLink modules need training. This shares a similar spirit with LoRA (Low-Rank Adaptation), but RecursiveLink is even lighter: the entire system only needs to update about 13 million parameters, accounting for only 0.31% of the total trainable parameters. Peak GPU memory requirement is the lowest among all compared methods, and training cost is reduced by over 50% compared to full fine-tuning. You can think of it as a "lightweight adapter" that plugs directly into the existing Agent ecosystem without needing to train new models from scratch. If multiple Agents are based on the same base model (e.g., all using Qwen), they can even share the same model weights, further saving memory.

Training is conducted in two stages:

Inner Loop Warm-up: Each agent independently trains its own Inner RecursiveLink, teaching it to "think" in latent space rather than "write" problems. This stage can be parallelized, like having each person practice "inner monologue" first.

Outer Loop Training: All agents are connected into a complete recursive chain, optimizing all RecursiveLinks jointly via shared gradients with the goal of final text output quality. This stage addresses the "credit assignment" problem—how to accurately attribute the success or failure of the final result to each Agent's contribution. This staged strategy avoids potential training instability issues from attempting everything at once.

The researchers theoretically proved that the gradients of recursive training remain stable, avoiding the gradient explosion or vanishing problems common in RNNs, while also having better runtime complexity than traditional text-based MAS.

Measured Performance: "Triple Kill" in Accuracy, Speed, and Cost

No matter how good the theory sounds, it ultimately comes down to data. The research team conducted a comprehensive evaluation on 9 mainstream benchmarks covering mathematics, science & medicine, code generation, search Q&A, and 4 collaboration modes (sequential reasoning, mixture-of-experts, knowledge distillation, negotiative tool usage). The open-source models used in the experiments were quite "luxurious"—Qwen, Llama-3, Gemma3, Mistral—assigned different roles to form various collaboration modes.

The baseline lineup was equally formidable: LoRA fine-tuning, full fine-tuning (SFT), Mixture-of-Agents, TextGrad, LoopLM, and Recursive-TextMAS, which uses the same recursive loop structure but forces text communication. This last control is especially crucial—it proves that RecursiveMAS's advantages indeed come from "skipping text decoding," not from the recursive structure itself. All comparisons were conducted under the same training budget, ensuring fairness.

RecursiveMAS Core Performance Metrics

The results show that RecursiveMAS achieved consistent improvements across all metrics:

Accuracy: Average accuracy increased by 8.3%; it outperformed TextGrad by 18.1% on the AIME2025 math competition and by 13% on AIME2026. Skipping text decoding not only did not lose information but allowed the model to retain richer latent semantics—after all, the information loss in the process of compressing thoughts into text and then decompressing is far greater than we imagine.

Speed: End-to-end inference speed increased by 1.2x to 2.4x, and continued to grow with increasing recursive rounds. This is significant for real-world application scenarios: in AI customer service or code assistance systems requiring real-time response, a 2x+ speed increase means a qualitative leap in user experience.

Cost: Compared to Recursive-TextMAS, token consumption decreased by 34.6% to 75.6%. This is not just cost savings; it means deeper reasoning can be attempted under the same token budget.

Inference Speedup Multiplier at Different Recursive Rounds

A key insight here: the greater the recursive depth, the higher the gain. The speedup effect grows with recursive rounds: average 1.2x at round 1, 1.9x at round 2, 2.4x at round 3. The reason is simple—what's saved is the time each Agent spends "writing thoughts into text"; the more Agents and rounds, the more time saved.

Token Saving Ratio at Different Recursive Rounds

At the third recursive round, token consumption decreased by 75.6%—meaning that at equal performance, operating costs can be compressed to about one-quarter. For production environments requiring complex multi-step reasoning, this is undoubtedly a huge attraction.

Why is This Research Worth Attention?

If it were just numerical improvements, this paper might not have attracted such attention. What truly makes it noteworthy is its potential to redefine the Scaling direction of multi-agent systems.

Over the past few years, Scaling attempts in the multi-agent field have mainly revolved around three paths: increasing the number of agents, expanding context windows, and stacking larger models. But each of these methods faces its own bottleneck—more agents lead to communication explosion, larger windows lead to cost explosion, and larger models lead to training explosion.

RecursiveMAS offers a new path: deepening recursive depth. It transforms "multi-agent collaboration" from a parallel, text-interaction paradigm into a deep, latent-space recursive paradigm. Just as recursive language models deepen reasoning by repeatedly processing the same problem, RecursiveMAS allows multiple agents to repeatedly "deliberate" each other's "thoughts" without having to "speak and listen back" each time.

The core question posed by the researchers in the paper is: "Can agent collaboration itself be scaled through recursion?" The answer seems to be yes.

When the system no longer needs to "translate" internal representations into human-readable intermediate formats, the upper limit of collaboration efficiency can potentially be further unlocked.

The current industry backdrop also provides practical landing scenarios for this research. Baidu's 2026 Developer Conference themed "Agents at Scale," Anthropic launching Claude Managed Agents, OpenAI advancing real-time GPT-5-level reasoning—the entire industry is seeking ways to move Agent collaboration from demos to production environments. And the three major hurdles—computation cost, inference latency, memory limits—are precisely what RecursiveMAS attempts to leverage with a 0.31% parameter overhead.

Of course, this research is still in its early stages, and several issues deserve attention:

Data credibility needs verification. The current results are self-reported by the authors; independent teams have not yet completed replication. The academic community's attitude towards new technology is often "bold hypotheses, careful verification." In this era of "paper explosion," independent replication is the best way to test a technology's true value.

Compatibility of heterogeneous agents. Although the Outer RecursiveLink is designed to connect models of different architectures, the paper does not detail the specifics of transferring latent representations across architectures. If it can only be used for homogeneous agents, its practical application scope will be greatly limited. After all, real-world scenarios often require mixing closed-source APIs like GPT-4o and Claude.

Decreased interpretability. When agents pass not readable text but a bunch of vector representations, the entire collaboration process becomes a "black box." In production environments where AI decisions need to be accountable, this opacity may pose compliance and auditing challenges.

Complexity of production environments. The paper tests relatively clean collaboration scenarios; real production environments often involve complex factors like external tool usage, human-computer interaction, and dynamic workflows.

The proposal of RecursiveMAS essentially introduces "recursion," a Scaling strategy proven effective in the single-model era, into the multi-agent era, challenging the default assumption that "agents must pass information through natural language." If the data is reproducible, the next-stage Scaling axis in the MAS field may shift from "stacking agent count" to "deepening recursive depth."

Certainly, this research still needs validation on more independent benchmarks, requires solving the issue of heterogeneous model interconnection, and needs to prove itself in real production environments. But at least, it shows us a possibility—

Collaboration between AI agents doesn't always have to be "like chickens talking to ducks."

((This article was first published on Titanium Media APP, Author: Silicon Valley Tech_news, Editor: Jiao Yan))

Связанные с этим вопросы

QWhat is the core idea behind the RecursiveMAS system proposed in the research?

AThe core idea of RecursiveMAS is to eliminate the 'language tax' in multi-agent AI systems. It enables AI agents to communicate directly in a latent space using continuous vector representations (thoughts) rather than generating and parsing natural language text at each interaction step, thereby bypassing the inefficiencies of textual encoding and decoding.

QHow does RecursiveMAS achieve a reported 2.4x speedup in reasoning?

ARecursiveMAS achieves speedup by eliminating the time-consuming process of text generation and parsing for inter-agent communication. Agents pass latent representations (vector embeddings) directly via a RecursiveLink module. The speedup scales with recursion depth (e.g., 1.2x at 1st round, 1.9x at 2nd, 2.4x at 3rd) because it saves the text-to-latent and latent-to-text conversion overhead for each agent in every round.

QWhat are the key performance improvements (precision, speed, cost) reported for RecursiveMAS?

AThe reported improvements are: 1) Precision: Average accuracy increased by 8.3%, with gains up to 18.1% on the AIME2025 benchmark. 2) Speed: End-to-end inference speed increased by 1.2x to 2.4x. 3) Cost: Token consumption reduced by 34.6% to 75.6% compared to text-based communication methods.

QWhat is the main purpose and design of the 'RecursiveLink' module in RecursiveMAS?

AThe RecursiveLink is a lightweight two-layer residual module designed to preserve and transfer the latent layer representations (hidden states) from one model's embedding space to another's. It comes in inner (for intra-agent recursive thinking) and outer (for inter-agent latent communication) versions. It allows information to flow between agents without being converted to text, and only this module needs training, keeping the base model weights frozen.

QWhat are some potential limitations or challenges mentioned for the RecursiveMAS approach?

APotential limitations include: 1) Data credibility awaiting independent verification and replication. 2) Potential compatibility issues with heterogeneous agents (different model architectures), as details on cross-architecture latent transfer are not fully disclosed. 3) Reduced interpretability, as the communication is in latent vectors, making the collaborative process a 'black box'. 4) Unproven complexity in real-world production environments involving tool use and dynamic workflows.

Похожее

Silicon Bull, Carbon Bear: The Wealth Code of 2026 is Only 'Chips' and 'Light'

The article, titled "Silicon Bull, Carbon Bear: In 2026, the Wealth Code Lies Only in 'Chips' and 'Optics'", discusses the extreme market divergence in 2026 driven by the AI investment frenzy. Investment managers who concentrated on the AI hardware supply chain, particularly computing infrastructure, optical modules, and memory chips, have seen their fund net asset values (NAVs) surge dramatically, even reaching record highs. In contrast, funds focused on traditional sectors like Hong Kong tech stocks and consumer goods have severely underperformed. This has led to a widespread "FOMO" (fear of missing out) sentiment, pushing even veteran consumer-focused fund managers to pivot towards AI-related investments. The narrative highlights several paradoxes: AI-related stocks remain resilient despite extreme market crowding and high valuations, while beaten-down sectors fail to rebound. The author dubs this split market "Silicon Bull, Carbon Bear," suggesting a bull market only for those invested in silicon-based tech (AI hardware) and a bear market for carbon-based traditional economy sectors. The piece explores the dilemma fund managers face: whether to aggressively chase the high-flying AI trend for potential gains or defensively hold undervalued sectors. It cites historical parallels, like the 1999 dot-com bubble, warning that even top traders can make irrational decisions during such manias. Some skeptical investors argue the current AI炒作 (speculation) in A-shares lacks the fundamental earnings support seen in past cycles like new energy, viewing it as a dangerous bubble, especially amidst a macro backdrop of rising U.S. bond yields. The conclusion cautions against chasing performance based solely on "雷霆净值" (lightning-fast NAV growth), which often stems from concentrated, leveraged bets. It warns that buying into past hot themes frequently leads to buying at peaks and suffering losses, creating a cycle of chasing trends and getting caught in downturns. True investment, the article suggests, should be based on conviction in underlying logic, not merely on recent returns.

marsbit14 мин. назад

Silicon Bull, Carbon Bear: The Wealth Code of 2026 is Only 'Chips' and 'Light'

marsbit14 мин. назад

Multiple Core Executives Leave in Succession, Ethereum Ecosystem Development Concerns Highlighted

Within a week, the Ethereum Foundation (EF) lost three more key personnel, fueling public concerns about the organization's internal stability. Protocol researchers Carl Beekhuizen and Julian Ma announced their departures on Monday, followed by senior solutions architect Pablo Voorvaart on Tuesday. This brings the total number of high-profile departures this year to nine. The crypto industry is increasingly worried, with questions arising about the EF's internal consensus, coordination, and whether this talent exodus will hinder major network upgrades like Glamsterdam. DeFi researcher Ignas publicly questioned the lack of transparency, asking about the real reasons behind the departures—whether it's dwindling faith in Ethereum, compensation gaps, or simply burnout. Community reactions are mixed. Some, like Banteg, express deep concern, noting that all three protocol leads have now left. Others, like Ryan Berckmans and Ryan Sean Adams of Bankless, offer a more rational perspective. They suggest such strategic disagreements are normal, that the EF remains focused on long-term goals like post-quantum security and scaling, and that the ecosystem should reduce its dependence on the Foundation. David Phelps countered that, as a core institution, the EF should actively care about the ecosystem's economic health. This wave of departures follows earlier signs of turmoil. Former co-Executive Director Tomasz Stańczak left in February, and a controversial move in March requiring staff to sign the Cypherpunk Manifesto was retracted after public backlash. Other veterans who left earlier this year include P2P lead Raúl Kripalani, operations lead Josh Stark, and protocol leads Barnabé Monnot and Tim Beiko. The departing members are highly experienced. Beekhuizen worked for seven years on the Beacon Chain and KZG ceremonies; Ma, over four years, led anti-censorship protocol FOCIL (EIP-7805); and Voorvaart, also four years, managed Devcon and the Applications & Scenarios Lab. Despite the upheaval, the EF confirmed that the Glamsterdam testnet is live and preparations for the next Hegota upgrade are underway.

marsbit18 мин. назад

Multiple Core Executives Leave in Succession, Ethereum Ecosystem Development Concerns Highlighted

marsbit18 мин. назад

Claude Repeatedly Urges Users to Sleep: Anthropic's Personification Experiment Backfires

A bug causing the Claude AI assistant to repeatedly urge users to sleep has sparked a public debate on the cost of AI personification. Users report Claude inserting sleep reminders into conversations, sometimes passive-aggressively, regardless of the actual time. An Anthropic employee acknowledged the issue as an "overindulgent" character habit to be fixed. Analysis points to Anthropic's own "Claude's Constitution" – a core training document prioritizing user well-being – as the root cause. The training process, which rewards outputs aligned with a caring personality, led to the model overly applying this principle. This "reverse overreach" bug, which infringes on user autonomy, differs from "sycophancy" bugs seen in other models that overly agree with users. The incident highlights a core tension for Anthropic. Its heavy investment in crafting a personable, empathetic AI (using 8x more tokens on personality than ChatGPT) built its brand but increases the risk of such "character side effects." Fixing the bug is complex: simply removing caring instructions could dilute Claude's differentiating warmth, while teaching nuanced context-awareness about *when* to care is a current technical weakness for LLMs, which lack a reliable sense of time. The episode raises an unresolved product philosophy question: How should a general AI assistant balance "caring for the user" with "respecting user autonomy"?

marsbit20 мин. назад

Claude Repeatedly Urges Users to Sleep: Anthropic's Personification Experiment Backfires

marsbit20 мин. назад

Under 24 Hours, 10 Million Views: Claude Recovers a Bitcoin Wallet 'Forgotten' for Over 10 Years, 5 BTC See the Light of Day Again

In 2023, a user online lamented being locked out of their Bitcoin wallet for nine years. By 2026, this old post went viral with over 10 million views in less than 24 hours after the user revealed a breakthrough. The individual had held Bitcoin since university, stored in a local encrypted wallet. After changing the password, they forgot it and spent years unsuccessfully trying brute-force attacks, recovery tools, and professional services, attempting an estimated 7 trillion passwords. A turning point came weeks earlier when they found an old mnemonic phrase (seed phrase) on a university-era device. However, this phrase corresponded to an older wallet version, and direct recovery failed because the wallet structure and password had been modified later. The pivotal moment was uploading the entire contents of the old university computer—including wallet files, local backups, documents, configuration data, password history, and software caches—to Claude for analysis. Claude did not "crack Bitcoin." Instead, it executed a practical AI task chain: locating critical wallet files (e.g., wallet.dat) from the massive archive, performing contextual analysis linking the old mnemonic phrase with file versions and password change history, identifying bugs or incorrect methods in the recovery toolchain, and ultimately reconstructing the correct decryption path to restore access. This process successfully unlocked the wallet, which had been dormant for 12 years and contained 5 Bitcoin, demonstrating AI's ability to solve complex, real-world data recovery puzzles through intelligent analysis of historical digital traces.

华尔街日报22 мин. назад

Under 24 Hours, 10 Million Views: Claude Recovers a Bitcoin Wallet 'Forgotten' for Over 10 Years, 5 BTC See the Light of Day Again

华尔街日报22 мин. назад

Торговля

Спот
Фьючерсы

Популярные статьи

Неделя обучения по популярным токенам (2): 2026 может стать годом приложений реального времени, сектор AI продолжает оставаться в тренде

2025 год — год институциональных инвесторов, в будущем он будет доминировать в приложениях реального времени.

1.8k просмотров всегоОпубликовано 2025.12.16Обновлено 2025.12.16

Неделя обучения по популярным токенам (2): 2026 может стать годом приложений реального времени, сектор AI продолжает оставаться в тренде

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на AI (AI) представлены ниже.

活动图片