Major AI Collaboration Breakthrough! Stanford and NVIDIA Jointly Eliminate AI Communication Overhead, Boosting Reasoning Speed by 2.4x

marsbitОпубликовано 2026-05-21Обновлено 2026-05-21

Введение

Title: AI Collaboration Breakthrough: Stanford & NVIDIA Eliminate Communication Overhead, Boost Reasoning Speed by 2.4x A new approach called RecursiveMAS, developed by UIUC, Stanford, NVIDIA, and MIT, tackles the major bottleneck in multi-agent AI systems: the "language tax." Currently, AI agents collaborate by generating and reading natural language text, a slow, costly, and information-lossy process akin to inefficient radio communication. RecursiveMAS bypasses this by enabling agents to communicate directly through their "thoughts"—latent space vector representations—instead of text. Inspired by recursive language models, it treats each agent like a reusable layer in a recursive loop. A special lightweight module called RecursiveLink passes these high-dimensional, semantic-rich internal states between agents. Only the final agent decodes the last latent representation into human-readable text. This process, described as "telepathic" communication, dramatically cuts the overhead of encoding and decoding text at each step. The system is highly efficient; the core AI model weights remain frozen, and only the small RecursiveLink modules are trained, requiring updates to just 0.31% of total parameters. This reduces training costs by over 50% compared to full fine-tuning. Comprehensive evaluations across math, science, coding, and QA benchmarks show significant improvements: - **Accuracy:** Average increase of 8.3%, with gains up to 18.1% on complex math problems (AIME2025)...

Imagine a scenario: you have three AI assistants collaborate to solve a math problem.

The traditional approach is: the first AI "writes" out the solution idea, the second AI "reads" it and writes a new idea, and the third AI "reads" and "writes" again.

This process is like three people taking turns using walkie-talkies to relay information, each time having to "translate" thoughts in their mind into language, and the other party "translating" the language back into thoughts. Is it slow? Yes. Is it costly? Yes. Even worse, this "translation" process loses information—what you think in your mind and what you say are often not the same thing.

This is the core dilemma faced by current multi-agent AI systems: "Language Tax."

Recently, a joint team from UIUC, Stanford, NVIDIA, and MIT proposed a new approach—RecursiveMAS. It allows AIs to skip the "speaking" step and communicate directly with "thoughts." In tests, reasoning speed increased by 2.4x, and token consumption was reduced by 75%.

(Paper link: https://arxiv.org/abs/2604.25917)

The Dilemma of AI Meetings: Efficiency Wasted on "Talking"

Over the past two years, multi-agent systems have become one of the hottest research directions in the AI field. From OpenAI's Swarm to Microsoft's AutoGen, from LangGraph to CrewAI, various players are exploring how to make multiple AIs collaborate to solve complex tasks that a single model cannot handle alone. However, in these systems, the collaboration efficiency of multiple agents is always constrained by a fundamental assumption—agents must communicate through natural language text.

When you have a "math expert" and a "code reviewer" collaborate, the whole process seems "reasonable," but breaking it down reveals many problems:

Each information transfer involves a double conversion: internal thought → text → internal thought. The tokens consumed in this process are not just money, but also precious computational resources and time. More crucially, this "write-out then read-in" process loses information—the rich semantics the model compresses into text during decoding cannot be fully recovered by the next model upon re-decoding. In a workflow involving five Agents, the time overhead for text encoding/decoding often accounts for over 60% of the total latency.

Even more troubling is that this paradigm lacks a clear "knob" for systematic optimization—add more agents? Marginal returns diminish, and communication overhead increases exponentially. Increase context window? Token costs explode. Increase model parameters? Individual agents become stronger, but collaboration efficiency doesn't improve fundamentally—it's like giving a group of people better walkie-talkies each, but they still have to read text aloud one by one; the communication method hasn't changed, so even if everyone is smarter, overall efficiency cannot have a breakthrough. Industry solutions, whether prompt engineering or LoRA fine-tuning, can only alleviate symptoms to some extent, unable to cure this fundamental architectural problem.

RecursiveMAS: Replacing "Walkie-Talkies" with "Telepathy"

The core idea of RecursiveMAS is very clever: since language is the bottleneck, then don't use language.

It draws inspiration from the idea of Recursive Language Models. In traditional language models, data flows from the first layer to the last, linearly; the more layers, the more parameters. Recursive language models do the opposite—instead of adding layers, they repeatedly cycle the same set of layers, letting data "circulate" back and forth between layers. Each pass through this set of layers is equivalent to an additional round of "thinking," deepening the reasoning depth without increasing parameter count.

RecursiveMAS extends this idea from "within a single model" to "multi-agent systems":

Each agent is like a layer in a recursive language model; they no longer generate text but pass "thoughts"—a continuous, vector representation existing in the latent space.

The researchers used a poetic analogy: "agents communicating telepathically as a unified whole."

Specifically, Agent A1 processes and passes its latent representation to Agent A2, A2 processes and passes to A3... until the last Agent processes, and its latent output is directly fed back to A1, starting a new round of recursive iteration. The entire process occurs entirely in latent space; only at the last Agent of the final round is the final latent representation decoded into text output. This is like a group of experts sitting around a table, not speaking, not writing notes; each person simply thinks silently and directly passes the "thought result" in their mind to the next person—the whole process is quiet and efficient.

Figure: RecursiveMAS architecture schematic—Multi-Agents achieve closed-loop recursive collaboration via embedding space (Source: arXiv)

A key component of this system is called RecursiveLink, a lightweight two-layer residual module responsible for preserving and transforming a model's latent layer representation and passing it to the next model's embedding space. The latent state of the language model's last layer already encodes rich semantic reasoning information; what RecursiveLink does is completely "move" these high-dimensional information over, rather than first translating it into text and then interpreting it. It comes in two versions: inner and outer.

Figure: Recursive learning process—Inner and outer links co-train (Source: arXiv)

In terms of training strategy, RecursiveMAS has a clever design: the backbone model weights are completely frozen; only the RecursiveLink modules need training. This shares a similar spirit with LoRA (Low-Rank Adaptation), but RecursiveLink is even lighter: the entire system only needs to update about 13 million parameters, accounting for only 0.31% of the total trainable parameters. Peak GPU memory requirement is the lowest among all compared methods, and training cost is reduced by over 50% compared to full fine-tuning. You can think of it as a "lightweight adapter" that plugs directly into the existing Agent ecosystem without needing to train new models from scratch. If multiple Agents are based on the same base model (e.g., all using Qwen), they can even share the same model weights, further saving memory.

Training is conducted in two stages:

Inner Loop Warm-up: Each agent independently trains its own Inner RecursiveLink, teaching it to "think" in latent space rather than "write" problems. This stage can be parallelized, like having each person practice "inner monologue" first.

Outer Loop Training: All agents are connected into a complete recursive chain, optimizing all RecursiveLinks jointly via shared gradients with the goal of final text output quality. This stage addresses the "credit assignment" problem—how to accurately attribute the success or failure of the final result to each Agent's contribution. This staged strategy avoids potential training instability issues from attempting everything at once.

The researchers theoretically proved that the gradients of recursive training remain stable, avoiding the gradient explosion or vanishing problems common in RNNs, while also having better runtime complexity than traditional text-based MAS.

Measured Performance: "Triple Kill" in Accuracy, Speed, and Cost

No matter how good the theory sounds, it ultimately comes down to data. The research team conducted a comprehensive evaluation on 9 mainstream benchmarks covering mathematics, science & medicine, code generation, search Q&A, and 4 collaboration modes (sequential reasoning, mixture-of-experts, knowledge distillation, negotiative tool usage). The open-source models used in the experiments were quite "luxurious"—Qwen, Llama-3, Gemma3, Mistral—assigned different roles to form various collaboration modes.

The baseline lineup was equally formidable: LoRA fine-tuning, full fine-tuning (SFT), Mixture-of-Agents, TextGrad, LoopLM, and Recursive-TextMAS, which uses the same recursive loop structure but forces text communication. This last control is especially crucial—it proves that RecursiveMAS's advantages indeed come from "skipping text decoding," not from the recursive structure itself. All comparisons were conducted under the same training budget, ensuring fairness.

RecursiveMAS Core Performance Metrics

The results show that RecursiveMAS achieved consistent improvements across all metrics:

Accuracy: Average accuracy increased by 8.3%; it outperformed TextGrad by 18.1% on the AIME2025 math competition and by 13% on AIME2026. Skipping text decoding not only did not lose information but allowed the model to retain richer latent semantics—after all, the information loss in the process of compressing thoughts into text and then decompressing is far greater than we imagine.

Speed: End-to-end inference speed increased by 1.2x to 2.4x, and continued to grow with increasing recursive rounds. This is significant for real-world application scenarios: in AI customer service or code assistance systems requiring real-time response, a 2x+ speed increase means a qualitative leap in user experience.

Cost: Compared to Recursive-TextMAS, token consumption decreased by 34.6% to 75.6%. This is not just cost savings; it means deeper reasoning can be attempted under the same token budget.

Inference Speedup Multiplier at Different Recursive Rounds

A key insight here: the greater the recursive depth, the higher the gain. The speedup effect grows with recursive rounds: average 1.2x at round 1, 1.9x at round 2, 2.4x at round 3. The reason is simple—what's saved is the time each Agent spends "writing thoughts into text"; the more Agents and rounds, the more time saved.

Token Saving Ratio at Different Recursive Rounds

At the third recursive round, token consumption decreased by 75.6%—meaning that at equal performance, operating costs can be compressed to about one-quarter. For production environments requiring complex multi-step reasoning, this is undoubtedly a huge attraction.

Why is This Research Worth Attention?

If it were just numerical improvements, this paper might not have attracted such attention. What truly makes it noteworthy is its potential to redefine the Scaling direction of multi-agent systems.

Over the past few years, Scaling attempts in the multi-agent field have mainly revolved around three paths: increasing the number of agents, expanding context windows, and stacking larger models. But each of these methods faces its own bottleneck—more agents lead to communication explosion, larger windows lead to cost explosion, and larger models lead to training explosion.

RecursiveMAS offers a new path: deepening recursive depth. It transforms "multi-agent collaboration" from a parallel, text-interaction paradigm into a deep, latent-space recursive paradigm. Just as recursive language models deepen reasoning by repeatedly processing the same problem, RecursiveMAS allows multiple agents to repeatedly "deliberate" each other's "thoughts" without having to "speak and listen back" each time.

The core question posed by the researchers in the paper is: "Can agent collaboration itself be scaled through recursion?" The answer seems to be yes.

When the system no longer needs to "translate" internal representations into human-readable intermediate formats, the upper limit of collaboration efficiency can potentially be further unlocked.

The current industry backdrop also provides practical landing scenarios for this research. Baidu's 2026 Developer Conference themed "Agents at Scale," Anthropic launching Claude Managed Agents, OpenAI advancing real-time GPT-5-level reasoning—the entire industry is seeking ways to move Agent collaboration from demos to production environments. And the three major hurdles—computation cost, inference latency, memory limits—are precisely what RecursiveMAS attempts to leverage with a 0.31% parameter overhead.

Of course, this research is still in its early stages, and several issues deserve attention:

Data credibility needs verification. The current results are self-reported by the authors; independent teams have not yet completed replication. The academic community's attitude towards new technology is often "bold hypotheses, careful verification." In this era of "paper explosion," independent replication is the best way to test a technology's true value.

Compatibility of heterogeneous agents. Although the Outer RecursiveLink is designed to connect models of different architectures, the paper does not detail the specifics of transferring latent representations across architectures. If it can only be used for homogeneous agents, its practical application scope will be greatly limited. After all, real-world scenarios often require mixing closed-source APIs like GPT-4o and Claude.

Decreased interpretability. When agents pass not readable text but a bunch of vector representations, the entire collaboration process becomes a "black box." In production environments where AI decisions need to be accountable, this opacity may pose compliance and auditing challenges.

Complexity of production environments. The paper tests relatively clean collaboration scenarios; real production environments often involve complex factors like external tool usage, human-computer interaction, and dynamic workflows.

The proposal of RecursiveMAS essentially introduces "recursion," a Scaling strategy proven effective in the single-model era, into the multi-agent era, challenging the default assumption that "agents must pass information through natural language." If the data is reproducible, the next-stage Scaling axis in the MAS field may shift from "stacking agent count" to "deepening recursive depth."

Certainly, this research still needs validation on more independent benchmarks, requires solving the issue of heterogeneous model interconnection, and needs to prove itself in real production environments. But at least, it shows us a possibility—

Collaboration between AI agents doesn't always have to be "like chickens talking to ducks."

((This article was first published on Titanium Media APP, Author: Silicon Valley Tech_news, Editor: Jiao Yan))

Связанные с этим вопросы

QWhat is the core idea behind the RecursiveMAS system proposed in the research?

AThe core idea of RecursiveMAS is to eliminate the 'language tax' in multi-agent AI systems. It enables AI agents to communicate directly in a latent space using continuous vector representations (thoughts) rather than generating and parsing natural language text at each interaction step, thereby bypassing the inefficiencies of textual encoding and decoding.

QHow does RecursiveMAS achieve a reported 2.4x speedup in reasoning?

ARecursiveMAS achieves speedup by eliminating the time-consuming process of text generation and parsing for inter-agent communication. Agents pass latent representations (vector embeddings) directly via a RecursiveLink module. The speedup scales with recursion depth (e.g., 1.2x at 1st round, 1.9x at 2nd, 2.4x at 3rd) because it saves the text-to-latent and latent-to-text conversion overhead for each agent in every round.

QWhat are the key performance improvements (precision, speed, cost) reported for RecursiveMAS?

AThe reported improvements are: 1) Precision: Average accuracy increased by 8.3%, with gains up to 18.1% on the AIME2025 benchmark. 2) Speed: End-to-end inference speed increased by 1.2x to 2.4x. 3) Cost: Token consumption reduced by 34.6% to 75.6% compared to text-based communication methods.

QWhat is the main purpose and design of the 'RecursiveLink' module in RecursiveMAS?

AThe RecursiveLink is a lightweight two-layer residual module designed to preserve and transfer the latent layer representations (hidden states) from one model's embedding space to another's. It comes in inner (for intra-agent recursive thinking) and outer (for inter-agent latent communication) versions. It allows information to flow between agents without being converted to text, and only this module needs training, keeping the base model weights frozen.

QWhat are some potential limitations or challenges mentioned for the RecursiveMAS approach?

APotential limitations include: 1) Data credibility awaiting independent verification and replication. 2) Potential compatibility issues with heterogeneous agents (different model architectures), as details on cross-architecture latent transfer are not fully disclosed. 3) Reduced interpretability, as the communication is in latent vectors, making the collaborative process a 'black box'. 4) Unproven complexity in real-world production environments involving tool use and dynamic workflows.

Похожее

Alibaba 'Stocks Up', ByteDance 'Trains'

"In late May, two closely timed events in China's AI industry clearly revealed the divergent strategic approaches of two tech giants: Alibaba and ByteDance. Alibaba is aggressively integrating AI into its existing commercial ecosystem, prioritizing immediate monetization. Its Qwen App now fully integrates with Taobao, leveraging the platform's 4-billion-item database for AI-powered shopping features like virtual try-on and price comparison. Internally, Alibaba has reorganized to incentivize AI-driven business growth, notably through the 'Agentic Commerce Trust Protocol' to enable AI-agent transactions. Financially, it emphasizes ROI, with CEO Daniel Wu stating every AI chip purchased is generating revenue. Alibaba's strategy bets that foundational AI model capabilities won't be leapfrogged in the next five years, allowing its 'AI-as-a-utility' approach to succeed. In stark contrast, ByteDance's Seed division focuses on pushing the frontiers of AGI with a long-term, research-oriented mindset. Its video generation model, Seedance 2.0, topped international benchmarks. The division, led by researchers Wu Yonghui and product head Zhu Wenjia, is tasked with 'exploring the upper limits of intelligence,' even considering open-sourcing its models—a rare move among Chinese firms. ByteDance is investing heavily, with reports of its 2026 capital expenditure plan being nearly triple that of 2024, funded by its substantial private profits. This allows it to pursue projects like an 8-month research paper questioning if video models are true 'world models,' devoid of immediate commercial pressure. The core divergence is less about corporate philosophy and more about structural constraints. As a publicly traded company, Alibaba is bound to quarterly financial expectations, forcing a pragmatic, revenue-focused AI integration. As a private entity, ByteDance has the luxury to fund long-term, high-risk foundational research without answering to public markets. The article concludes that the true determinant of a Chinese company's AI path is its IPO status, suggesting that if ByteDance were public, or if Alibaba were private, their strategies might well be reversed."

marsbit40 мин. назад

Alibaba 'Stocks Up', ByteDance 'Trains'

marsbit40 мин. назад

Why More AI Agents Does Not Equal Higher Productivity?

Editor's Note: As AI Agents become cheaper and easier to use, a new constraint emerges: the cost isn't in launching more Agents, but in the human attention required to manage, judge, and integrate their outputs. This hidden cost is called the "orchestration tax." The article argues that a developer's cognitive bandwidth is the key bottleneck—a serial, non-parallelizable resource akin to a Global Interpreter Lock (GIL). While many Agents can run concurrently, their results ultimately require human judgment for review, conflict resolution, and final integration. Therefore, more Agents don't automatically mean higher productivity; they can simply create longer queues, lead to cognitive fatigue, and create the illusion of busyness without real output. The core solution is to design workflows around this scarce human attention. Key strategies include: scaling the number of Agents to match review capacity (not UI capacity), categorizing tasks (delegating independent ones, keeping complex judgment-heavy ones serial), batch reviewing results to minimize context-switching costs, automating verifiable checks to reserve human judgment for critical decisions, and protecting focused, uninterrupted thinking time. Ultimately, the critical skill is not launching many Agents, but architecting systems that respect the fundamental limit of human attention. Unpaid "orchestration tax" accumulates as both technical and cognitive debt, undermining system understanding and quality. True productivity comes from thoughtfully managing the single-threaded resource—your focus.

marsbit2 ч. назад

Why More AI Agents Does Not Equal Higher Productivity?

marsbit2 ч. назад

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

Three Years Later: Revisiting My 2023 Predictions on ChatGPT In March 2023, shortly after ChatGPT's launch, I made 20 predictions about its future. Now, in mid-2026, I've used AI agents to fact-check each one against the latest data. Overall, most major directional forecasts were correct, with only one outright error (incorrectly stating GPT-4 had 100 trillion parameters). Key successes included predicting that RAG and retrieval architectures would become the standard for handling knowledge and hallucinations, that natural language interfaces (LUI) would create a massive new industry layer beyond the models themselves, and that China would develop viable large language models, significantly closing the performance gap with Western counterparts within about three years. Predictions about the absence of mass unemployment, the rise of a new "robot network" for agent communication, and ChatGPT not possessing consciousness also held true in their core arguments. However, the "devil was in the details." Errors frequently involved specific numbers, timelines, or overlooking distributional effects. I tended to overestimate the speed of adoption (e.g., for agent networks) while underestimating the ultimate scale of capabilities or costs (e.g., AI winning IMO gold without tools, or the extreme capital required for frontier models). Other misjudgments included: underestimating how AI would reinforce, not dissolve, information filter bubbles; incorrectly assuming AI-generated content would easily circumvent copyright (it has instead triggered record-breaking settlements); and misidentifying where value would be captured (it accrued overwhelmingly to the compute layer, like Nvidia, not just the application or model layers). Key lessons from reviewing these predictions are: 1) Directional and mechanistic insights are far more reliable than precise numbers or absolute statements. 2) There's a consistent bias to overestimate short-term speed but underestimate long-term magnitude. 3) Errors often lie in missing distributional impacts within a generally correct aggregate trend. 4) Predictions phrased with nuance and caveats aged the best. 5) Some fundamental debates (e.g., on machine consciousness or the ultimate value chain) remain unresolved even after three years. This exercise is less about scoring the past and more about establishing rules for clearer thinking about the next three years of AI.

marsbit8 ч. назад

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

marsbit8 ч. назад

Торговля

Спот
Фьючерсы

Популярные статьи

Неделя обучения по популярным токенам (2): 2026 может стать годом приложений реального времени, сектор AI продолжает оставаться в тренде

2025 год — год институциональных инвесторов, в будущем он будет доминировать в приложениях реального времени.

1.8k просмотров всегоОпубликовано 2025.12.16Обновлено 2025.12.16

Неделя обучения по популярным токенам (2): 2026 может стать годом приложений реального времени, сектор AI продолжает оставаться в тренде

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на AI (AI) представлены ниже.

活动图片