Xiaomi MiMo's 99% Price Cut is Not Marketing! Luo Fuli Posts on X to Refute Critics

marsbitPublicado em 2026-05-31Última atualização em 2026-05-31

Resumo

The price of Xiaomi's MiMo-V2.5 series API has been permanently reduced by up to 99%, specifically for the "Input (Cache Hit)" cost, which covers users re-reading historical context in long conversations. MiMo's head, Luo Fuli, published a detailed technical blog to clarify that this drastic price cut stems from genuine engineering breakthroughs, not a marketing stunt or a simple price war. The core of the achievement lies in six key engineering optimizations. First, the model architecture adopts a Hybrid Sliding Window Attention (SWA), reducing the memory footprint (KVCache) to 1/7th of a traditional model. Second, a dual-pool memory management system actually utilizes these savings, allowing a single GPU to handle over 5 times more concurrent users. Third, an upgraded prefix caching mechanism achieves a cache hit rate of 93-95% for repeated reads, meaning most such requests bypass GPU computation entirely. Fourth, a self-developed distributed cache (GCache) utilizes idle SSD space on existing GPU servers, eliminating additional storage costs. Fifth, an intelligent scheduling system (LLM-Router) efficiently routes requests to maximize cache reuse and performance. Sixth, Multi-Token Prediction (MTP) accelerates the model's text generation ("output") side. Together, these systemic optimizations dramatically lower the real computational cost per request, enabling the 99% price reduction for cached inputs while reportedly maintaining positive gross margins. Luo Fuli's disclosu...

By Xiang Xianzhi

Luo Fuli posted on X to draw a line under the price cut controversy surrounding Xiaomi's MiMo.

On May 26th, the official Xiaomi MiMo account posted an announcement on X: The MiMo-V2.5 series APIs are now permanently reduced in price, with a maximum discount of 99%. All context lengths are now uniformly priced, and token packages have been upgraded by 5-8 times.

This announcement circulated within China's AI circle for a whole week. The initial industry reaction was divided into several camps. The largest camp called it "another round of price wars"—over the past two years, from Zhipu, DeepSeek, Byte's Doubao to Alibaba's Tongyi, domestic large models have taken turns cutting prices; everyone is competing fiercely.

Another camp took a more pessimistic view: Xiaomi had just announced a halving of profits this year, and now they're still burning 600 billion on AI, slashing API prices by 90%—typical "losing money to grab market share." Others saw it as a continuation of the DeepSeek effect—the latter dragged the entire industry's pricing baseline to the floor, and whoever doesn't follow will be left out.

Therefore, as the head of MiMo, Luo Fuli directly shared a 5000-word technical blog post last night, publicly revealing the engineering calculations behind the price cut to everyone.

"Look, this is genuine engineering capability, not a marketing tactic."

To understand what Luo Fuli is saying, one must first grasp what exactly this 99% discount applies to.

It is not a price cut for the entire model. The 99% discount specifically targets a pricing tier called Input (Cache Hit)—that is, the portion where "users repeatedly read historical context in long conversations." The discount for ordinary new input (No Cache Hit) is much smaller, and the discount for model output (Output) is the smallest.

If you think of the model as a coffee shop, this becomes easier to understand.

You order a half-sugar latte. The coffee shop has two ways to make it: grind the beans, measure the syrup, and pour the milk from scratch each time, paying for ingredients and labor each time; but if the model knows you want the same half-sugar latte every day this week, it can simply make a large pot and store it in the fridge, scooping out a cup next time. What MiMo has done is the latter—changing the user's repeated reads from "calculate on the fly" to "retrieve on the fly," so the real cost of this part is close to 0, naturally allowing for a 99% discount.

To achieve "retrieve on the fly," the technical blog discusses six engineering feats, each indispensable. Let's break them down one by one.

Engineering One: Compress Model "Memory" to 1/7

When the model is conversing with you, it needs to calculate and store an "intermediate state" for each token for the next step. This is called KVCache—think of it as the model's "short-term memory notebook." With each sentence spoken, the model notes a summary of that sentence in its notebook, so next time it can just flip through the notes instead of listening to everything you've said from the beginning.

Traditional models use "Full Attention" at every layer—meaning each token looks at all tokens in the entire conversation segment, making the notebook thicker and thicker. MiMo-V2.5-Pro changed the architecture: Out of 70 layers, 60 layers only look at the most recent 128 tokens (SWA, Sliding Window Attention), leaving only 10 "archivist" layers to see everything.

The result is that the KVCache volume is directly compressed to 1/7 of Full Attention, with computational load also at 1/7.

This is the first foundation for cost reduction. Using an analogy, originally every employee in a company was required to remember all meeting minutes, resulting in everyone's brain being overloaded and inefficient. The new rule reduces the mental burden for 60 employees to 1/7, leaving only 10 archivists to manage all history—the company's overall memory capacity doesn't decrease, but efficiency increases 7-fold.

Engineering Two: Make the Space Saved by SWA Actually Usable

Architecturally compressing the notebook to 1/7 is the first step, but to turn "theoretical 1/7" into "actual 1/7," there is another hurdle.

Traditional KVCache systems allocate video memory (VRAM) uniformly to all layers based on "maximum possible usage." This means: even though the 60 SWA layers only need a small notebook, the system allocates based on the "archivist's large notebook" for all layers—the space saved by SWA is wasted, reserved but not used, equivalent to saving nothing.

The approach by Luo Fuli's team was to split the KVCache into two independent pools. The 10 Full Attention layers use the "large pool," allocated based on full length; the 60 SWA layers use the "small pool," allocated only based on a 128-token window.

Using an analogy, originally the company gave every employee a "filing cabinet capable of holding 100 years of documents"—but 60 employees actually only need a "small cabinet holding one week's documents," leaving 99% of the space in those large cabinets empty. The new approach allocates cabinets based on actual needs. As a result, the entire office can fit over 5 times more colleagues to work—concurrent users served by the same GPU increases 5-fold.

This step seems simple, but without it, the advantage of the previous SWA architecture would be in vain.

Engineering Three: Ensure "Returning Users Re-reading" Can Actually Hit the Cache

With the notebook compressed to 1/7 and the space truly usable, the next step is to solve an old problem: prefix cache hit rate.

Many user conversations share the same beginning—the same system prompt, the same codebase, the same long document. The system stores the computed results, and the next time a match occurs, it reuses them directly. This mechanism is called prefix caching.

But a pitfall arises in SWA mode: two requests having the same tokens does not guarantee the KV is still there. The prefix might have been computed, but parts outside the SWA window might have been evicted long ago. If the system still follows the old rule of "same tokens equals a hit" for reuse, it might read invalid or overwritten data, causing the model's performance to crash.

Luo Fuli's team upgraded the rule to "window-safe length"—only guaranteeing the portion "you can completely borrow."

Using an analogy, a library has 1 million books. You want to borrow the complete three-volume set of "The Three-Body Problem." The original architecture would tell you "the book is here," you go over only to find the shelf has only the cover and the first volume, the latter two have been borrowed. This "false hit" wastes your trip and requires re-borrowing. The new system's rule changes to only guarantee the portion you can completely borrow—first give you the first volume, then fetch the next two for you.

It sounds stricter, as if the hit rate would drop. But the opposite is true: because SWA compresses KVCache volume to 1/7, the same storage space can hold several times more content, significantly increasing the real hit rate.

Luo Fuli's blog provides actual online test numbers: Under mainstream harness frameworks, server-side cache hit rate averages 93%, and for high-frequency, long-cycle users, it can reach over 95%.

Translating this number's meaning: 95% of "re-read" requests don't need GPU computation at all; they're fetched directly from the cache. This is the physical basis for the 99% discount.

Engineering Four: Store "Cache" in the GPU's Built-in SSD

With the hit rate up, the next question is: where to store this cache.

Video memory (HBM memory on the GPU) is expensive and limited—an eight-card H100 machine has only 640GB VRAM, but the KVCache MiMo needs to store could be on the scale of tens of terabytes. Therefore, tiered storage is necessary: recently used data stays in VRAM (L1), slightly older data goes to CPU memory (L2), and cold data is stored in distributed cache (L3).

It's like managing money. Cash in your wallet is VRAM—immediately accessible but can't hold much. Bank account balance is CPU memory—takes 30 seconds to access but can hold a lot. Term deposits are L3 distributed cache—takes 2 minutes to access but is much cheaper.

The industry's common practice is to build a separate storage cluster for L3, with dedicated machines, dedicated data centers, paying rent monthly.

Xiaomi's storage team did it differently. They developed their own distributed cache system called GCache, deployed directly on the SSDs built into the GPU machines—co-located on the same machines with training tasks and inference tasks.

Plain translation: Others rent a separate warehouse to store large amounts of data; Xiaomi found the GPU machine's garage is actually empty and just stored the data there. Monthly rent is saved.

The exact words from the technical blog are: "The additional storage cost is 0."

The impact of this is bigger than it seems. In the conventional "AI company compute cost calculation," storage cost is a fixed expense item—the larger your model and the more users, the longer the storage bill. The GCache approach eliminates this item entirely. Combined with SWA's small size + 93-95% hit rate, the Time-To-Live (TTL) for KVCache in L3 extends from minutes to hours or even days—the longer the TTL, the wider the window for historical context to be hit, the higher the cache hit rate, and the more solid the foundation for that 99% discount.

Engineering Five: Route Cache-Hit Requests Through the Shortest Path

The cache can be stored, queried, and is cheap. The final step is: how to route the correct requests to the correct machines.

Xiaomi developed its own scheduling system called LLM-Router, which does three things:

First, Affinity Scheduling. Routes requests with the same prefix to the same machine, maximizing cache reuse.

Second, Length Bucketing. Separates short requests (0-64K), medium requests (64K-256K), and long requests (256K-1M) into different processing channels, preventing short requests from being delayed by long ones.

Third, TTFT Optimization. Within the queue waiting for inference, prioritizes scheduling requests with smaller real computation loads (i.e., those with high cache hits)—avoiding them being blocked by "brand-new input" requests that require heavy computation.

For example, in regular airport scheduling, all passengers flying to the same destination are gathered in the same lounge, sharing the baggage claim process—this is Affinity Scheduling. Those with carry-ons and those with 3 large checked bags go through two separate security lines, so the fast aren't slowed by the slow—this is Length Bucketing. During boarding, prioritize passengers with only carry-ons; they board quickly, allowing the plane to depart earlier—this is TTFT Optimization.

This scheduling strategy, in practice, increased L2 cache hit rate by 25%, single-machine input throughput by 30%, and reduced P90 latency for long requests by 30%.

Translated: The same GPU can serve more users. The other half of the price cut logic lies here—effective output per unit of compute is higher, cost per user is lower.

Engineering Six: Make the Model "Type" Faster Too

The first five items optimized the "read" side—reducing the cost of users repeatedly reading historical context to near zero. The sixth item optimizes the "write" side—the process of the model generating the next token.

Traditional models can only generate 1 token at a time. MiMo natively supports 3-layer MTP (Multi-Token Prediction)—predicting the next 3 tokens at once. If the middle predictions are correct, it directly skips the intermediate computations.

Using an analogy, traditional typing is one character at a time—to type "今天天气" (today's weather), you press 4 keys. MTP is like having autocomplete guess your next 1-2 characters—if it guesses correctly, you don't need to press those two keys.

MiMo's MTP, tested in agentic scenarios: decoding the first 128 tokens is accelerated by 2.3x, and tokens 128-256 are accelerated by 1.5x.

The significance of this is that the 99% discount specifically targets Input (Cache Hit), but when the model actually serves a user, input and output occur within the same request—if output isn't saved, the overall request cost is only halved. MTP reduces the output half as well, closing the loop on the entire price cut's profit model.

Stringing the six items together into a cost-reduction chain:

SWA architecture → KVCache 1/7 → Dual pools truly release capacity → Same GPU can handle 5+ times more concurrent users → Prefix cache hit rate 93-95% → 95% of requests almost require no computation → GCache zeroes storage cost → Scheduling prioritizes hit requests → MTP also saves generation → GPU time per request drops by an order of magnitude → Unit cost drops 95%+ → Pricing cut 99%, gross margin remains positive.

If any link is missing, the chain breaks. The 99% price cut is not a marketing figure; it's the cumulative effect of six engineering pillars combined and validated in real online operations.

Looking back at the initial industry interpretations, each had some truth. The price wars among Chinese large model companies over these two years are real; Xiaomi's halved profits while still investing heavily in AI is real; DeepSeek dragging the industry's pricing baseline to the floor is also real.

But by publicly releasing this technical blog and unpacking the detailed technical specifics, Luo Fuli undoubtedly hopes to counter the price war narrative, letting "technical matters belong to technology, and marketing matters belong to marketing."

She wrote in the blog that the inference efficiency of the MiMo-V2.5 series models does not come from a single breakthrough in one area but is the result of multi-dimensional collaborative optimization. Hybrid SWA benefits both prefill and decode, but poorly optimized KVCache implementation can actually increase costs at various stages. To achieve this goal, the MiMo team systematically rebuilt KVCache management, hierarchical caching, and prefix cache trees, tackling the core issues of SWA KVCache, optimizing scheduling strategies and Prefill/Decode pipelines. After validation in real online scenarios, they finally translated its theoretical efficiency advantages into production environments. Only then did Hybrid SWA demonstrate its architectural advantage of both strength and efficiency in long-context reasoning. Combined with MoE configurations and various optimizations for multimodal inference, it significantly improved the performance of online inference services.

This is a systematic approach to AI engineering, and a cost-reduction method worthy of reference and learning for the industry.

A price war doesn't require writing a blog post; engineering delivery does.

Perguntas relacionadas

QWhat was the main announcement from Xiaomi's MiMo official account on May 26th, and what was the highest price reduction?

AOn May 26th, Xiaomi's MiMo official account announced that the MiMo-V2.5 series API pricing was permanently reduced, with the highest reduction being 99%. All context lengths were unified in pricing, and Token packages were upgraded by 5-8 times.

QWhat specific type of pricing did the 99% discount primarily target, and what does it represent in user interaction?

AThe 99% discount specifically targeted the 'Input (Cache Hit)' pricing tier. This represents the cost for the part where users repeatedly read historical context in long conversations.

QAccording to Luo Fuli's technical blog, what was the key engineering achievement that reduced the KVCache volume to 1/7 of Full Attention?

AThe key engineering achievement was changing the model architecture in MiMo-V2.5-Pro. Out of 70 layers, 60 layers use Sliding Window Attention (SWA), focusing only on the most recent 128 tokens, while only 10 'archivist' layers view the entire context. This reduced the KVCache volume to 1/7 of Full Attention.

QWhat is GCache, and how did it help achieve cost savings according to the article?

AGCache is a self-developed distributed caching system by Xiaomi's storage team. It is deployed directly on the SSDs that come with the GPU machines, co-locating with training and inference tasks. This approach eliminated the need for a dedicated storage cluster, resulting in 'additional storage cost of 0.'

QWhat are the three main functions of the LLM-Router scheduling system developed by Xiaomi?

AThe LLM-Router scheduling system performs three main functions: 1) Affinity scheduling: routing requests with the same prefix to the same machine to maximize cache reuse. 2) Length bucketing: separating short, medium, and long requests into different processing channels. 3) TTFT optimization: prioritizing requests with small real computational load (like those hitting cache heavily) in the inference queue.

Leituras Relacionadas

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

Three Years Later: Revisiting My 2023 Predictions on ChatGPT In March 2023, shortly after ChatGPT's launch, I made 20 predictions about its future. Now, in mid-2026, I've used AI agents to fact-check each one against the latest data. Overall, most major directional forecasts were correct, with only one outright error (incorrectly stating GPT-4 had 100 trillion parameters). Key successes included predicting that RAG and retrieval architectures would become the standard for handling knowledge and hallucinations, that natural language interfaces (LUI) would create a massive new industry layer beyond the models themselves, and that China would develop viable large language models, significantly closing the performance gap with Western counterparts within about three years. Predictions about the absence of mass unemployment, the rise of a new "robot network" for agent communication, and ChatGPT not possessing consciousness also held true in their core arguments. However, the "devil was in the details." Errors frequently involved specific numbers, timelines, or overlooking distributional effects. I tended to overestimate the speed of adoption (e.g., for agent networks) while underestimating the ultimate scale of capabilities or costs (e.g., AI winning IMO gold without tools, or the extreme capital required for frontier models). Other misjudgments included: underestimating how AI would reinforce, not dissolve, information filter bubbles; incorrectly assuming AI-generated content would easily circumvent copyright (it has instead triggered record-breaking settlements); and misidentifying where value would be captured (it accrued overwhelmingly to the compute layer, like Nvidia, not just the application or model layers). Key lessons from reviewing these predictions are: 1) Directional and mechanistic insights are far more reliable than precise numbers or absolute statements. 2) There's a consistent bias to overestimate short-term speed but underestimate long-term magnitude. 3) Errors often lie in missing distributional impacts within a generally correct aggregate trend. 4) Predictions phrased with nuance and caveats aged the best. 5) Some fundamental debates (e.g., on machine consciousness or the ultimate value chain) remain unresolved even after three years. This exercise is less about scoring the past and more about establishing rules for clearer thinking about the next three years of AI.

marsbitHá 1h

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

marsbitHá 1h

Three Years Later: Looking Back on My 2023 Predictions for ChatGPT

Looking Back After Three Years: Revisiting My 2023 Predictions on ChatGPT In March 2023, shortly after ChatGPT's debut and before GPT-4's release, I made over twenty predictions about AI's future based on limited information and intuition. Now, in May 2026, I revisited those forecasts using an AI-driven analysis with 41 Opus 4.8 agents to cross-reference them with the latest data. The assessment used symbols: ✅ Correct, 🟢 Mostly Correct, 🟡 Partially Correct, ❌ Incorrect. Overall, the directional judgments held up well, with only one major factual error regarding GPT-4's rumored parameter size (incorrectly cited as 100T). However, nuances and degrees of accuracy revealed more. **What Was Largely Correct:** Predictions about mechanisms and directions proved accurate. The rise of RAG (Retrieval-Augmented Generation) as the standard architecture for combating AI hallucination was confirmed, as was the transformative potential of LUI (Language User Interface) in creating a new industry layer atop GUIs. The emergence of "robot networks" (agent-to-agent communication protocols) and China's rapid catch-up in developing capable large models (closing the performance gap with top models to ~2.7%) were also on point. The analysis affirmed that LLMs lack consciousness and that the Turing Test merely measures perceived intelligence. **What Was Off Target:** Errors often involved specific numbers, over-optimistic timelines, or misjudged distributions. The prediction that value would primarily accrue to the application layer was half-right but missed NVIDIA's dominance as the profitable infrastructure layer. Forecasts about AI circumventing copyright issues and fostering a "global common ground" by averaging human viewpoints were incorrect; instead, major copyright settlements occurred and AI personalization is increasing. Estimates for model training costs ("$5-10 billion cap") were significantly off, underestimating frontier costs and overestimating replication costs. The notion that LLMs could never do complex math without tools was disproven by later models winning IMO gold. **Key Patterns from the Review:** 1. **Direction over precision:** Judgments about mechanisms and trends were more reliable than specific numbers or definitive statements. 2. **Timing bias:** There was a tendency to overestimate short-term speed but underestimate long-term magnitude and transformation. 3. **The distribution blind spot:** Aggregate-level correctness often masked uneven impacts (e.g., on young professionals' employment). 4. **The value of qualifiers:** Predictions framed with caution (e.g., "reportedly," "for now," "prototype in 2-3 years") aged better. 5. **Some debates continue:** Issues like the nature of "emergent abilities" or machine consciousness remain unresolved. This three-year review highlights that while seeing the big picture is crucial, humility regarding specifics, timelines, and disparate impacts is essential for future forecasting.

链捕手Há 3h

Three Years Later: Looking Back on My 2023 Predictions for ChatGPT

链捕手Há 3h

AI Bubble Warning: AI Investments Are Negative Returns for Most Tech Giants

The article issues a stark warning about a potential AI investment bubble. It notes that while the AI boom shares similarities with the TMT bubble of the late 1990s, its scale is vastly larger, currently driving 93% of U.S. GDP growth. Major hyperscale cloud providers like Microsoft, Alphabet, Amazon, Meta, and Oracle are planning to invest trillions in AI data centers over the coming years. However, calculations based on analyst projections for 2025-2030 reveal a concerning math problem: expected capital expenditure growth far outpaces projected revenue growth. Even under an extremely optimistic scenario of zero costs, the implied return on investment for most of these tech giants (except Amazon) is deeply negative. This suggests that the current trajectory could lead to one of history's largest shareholder value destruction events. The piece outlines two potential escapes: AI generating vastly more revenue than currently anticipated—a near-impossible task—or a significant cutback in the planned investment splurge. The latter scenario could trigger a domino effect, severely impacting the entire tech supply chain (from Nvidia to TSMC), potentially pushing the U.S. economy into recession, and causing a major stock market downturn. The author suggests upcoming high-profile IPOs by companies like OpenAI and Anthropic might represent a transfer of risk from early investors to public market participants. While the peak of the hype cycle might sustain investment through 2026, the fundamental financial dilemma remains unresolved, setting the stage for a potential market correction in 2027 or 2028, similar to the years following Alan Greenspan's "irrational exuberance" warning.

marsbitHá 4h

AI Bubble Warning: AI Investments Are Negative Returns for Most Tech Giants

marsbitHá 4h

From Tokens to Machine Labor: AI is Shifting from Tool to "Worker"

The article "From Token to Machine Labor: AI is Evolving from Tool to 'Worker'" argues that the business model for AI is shifting beyond simply selling computational resources (tokens, GPU hours) or model access. Instead, a new "machine labor market" is emerging, where the core economic transaction is the purchase of economically useful work directly performed by software. The central thesis is that AI pricing will evolve through four stages: 1) raw tokens, 2) standardized LLM capabilities (e.g., text generation), 3) industry-specific labor markets (e.g., legal review, radiology), and finally 4) a programmable results market where tasks like resolving a support ticket are bid on and priced based on outcome. In this future, buyers will care less about *which* model or GPU completes a task and more about whether the work meets specified standards for accuracy, latency, and cost. This transition reframes the impact of AI on human labor. Rather than simple replacement, it suggests a re-coordination where machines handle standardized, verifiable work, freeing humans for roles involving oversight, context management, responsibility, and final judgment. In some cases, this "last 1%" of human input becomes more valuable as it enables the other 99% to be automated. Furthermore, as AI reduces the cost of work, demand may expand, creating larger markets (e.g., 24/7 customer service) rather than just cheaper versions of existing ones. The article concludes that while infrastructure (GPUs, models, tokens) remains crucial upstream, the market is converging on a simpler, tradeable unit: machine labor that can be defined, measured, priced, and procured based on contractible specifications.

marsbitHá 4h

From Tokens to Machine Labor: AI is Shifting from Tool to "Worker"

marsbitHá 4h

$26 Billion: An 'All-Chinese Team' Backs the World's Highest-Valued AI Programming Company

Cognition AI, the company behind the AI programmer "Devin," has raised over $1 billion in new funding at a valuation of $26 billion, just eight months after reaching a $10.2 billion valuation. The round was led by Lux Capital, General Catalyst, and 8VC. Founded by three young Chinese entrepreneurs with strong competitive programming backgrounds, Cognition initially gained fame with Devin, marketed as the world's first AI software engineer capable of handling tasks from start to finish. While its early demos were impressive, real-world usage revealed reliability and cost-effectiveness issues, leading to a significant price cut for Devin in 2025. A pivotal moment came when Cognition acquired the assets of AI IDE company Windsurf after a failed acquisition by OpenAI. This move gave Cognition a crucial developer-facing tool, allowing it to pursue a two-pronged strategy: Devin for autonomous task execution and Windsurf for integrated, collaborative coding within an IDE. This shift helped the company move away from the controversial "AI replacement" narrative towards a model of augmenting human engineers, particularly for repetitive or maintenance tasks. This strategic pivot is backed by strong commercial metrics. The company reports a 10x increase in enterprise usage this year, with an annual revenue run-rate of $492 million and a 50% month-over-month growth in enterprise Devin usage over the past six months. Its client list now includes major corporations like Goldman Sachs and Mercedes-Benz, as well as government agencies like NASA and the U.S. Army. Investors are betting on Cognition becoming a foundational piece of next-generation software engineering infrastructure, positioning it at the center of a hybrid future where AI agents and human developers work in tandem.

marsbitHá 6h

$26 Billion: An 'All-Chinese Team' Backs the World's Highest-Valued AI Programming Company

marsbitHá 6h

Trading

Spot
Futuros
活动图片