Written by: Bruce
Lately, the entire tech and investment communities have been fixated on the same thing: how AI applications are "killing" traditional SaaS. Since @AnthropicAI's Claude Cowork demonstrated how easily it can help you write emails, create PPTs, and analyze Excel spreadsheets, a panic about "software is dead" has begun to spread. This is indeed frightening, but if your gaze stops here, you might be missing the real seismic shift.
It's like we're all looking up at a drone dogfight in the sky, but no one notices that the entire continental plate beneath our feet is quietly shifting. The real storm is hidden beneath the surface, in a corner most people can't see: the computing power foundation that supports the entire AI world is undergoing a "silent revolution."
And this revolution might end the grand party hosted by AI's shovel seller: NVIDIA @nvidia, much sooner than anyone imagined.
Two Revolutionary Paths Converging
This revolution isn't a single event, but rather the convergence of two seemingly independent technological paths. They are like two armies closing in, forming a pincer movement against NVIDIA's GPU hegemony.
The first path is the algorithm slimming revolution.
Have you ever thought about whether a super brain really needs to mobilize all its cells when thinking about a problem? Obviously not. DeepSeek figured this out with their Mixture of Experts (MoE) architecture.
You can think of it like a company with hundreds of experts in different fields. But every time you need to solve a problem, you only call upon the two or three most relevant experts, rather than having everyone brainstorm together. This is the cleverness of MoE: it allows a massive model to activate only a small portion of its "experts" during each computation, drastically saving computing power.
What's the result? The DeepSeek-V2 model nominally has 236 billion "experts" (parameters), but only needs to activate 21 billion of them for each task—less than 9% of the total. Yet its performance is comparable to GPT-4, which requires 100% full operation. What does this mean? AI capability is decoupling from the computing power it consumes!
In the past, we all assumed that the stronger the AI, the more GPUs it would need. Now, DeepSeek shows us that through clever algorithms, the same effect can be achieved at one-tenth the cost. This directly puts a huge question mark on the essential nature of NVIDIA GPUs.
The second path is the hardware "lane change" revolution.
AI work is divided into two phases: training and inference. Training is like going to school, requiring reading countless books (data); here, GPUs with their "brute force" parallel computing are indeed useful. But inference is like our daily use of AI, where response speed is more critical.
GPUs have an inherent weakness in inference: their memory (HBM) is external, and data transfer back and forth causes latency. It's like a chef whose ingredients are in a fridge in the next room; every time they cook, they have to run over to get them—no matter how fast, it can't be instant. Companies like Cerebras and Groq started from scratch, designing dedicated inference chips that solder the memory (SRAM) directly onto the chip, putting the ingredients right at hand, achieving "zero latency" access.
The market has voted with real money. OpenAI, while complaining about NVIDIA's GPU inference performance, turned around and signed a $10 billion deal with Cerebras specifically to rent their inference services. NVIDIA itself panicked, spending $20 billion to acquire Groq, precisely to not fall behind in this new race.
When the Two Paths Converge: A Cost Avalanche
Now, let's put these two things together: run an algorithmically "slimmed-down" DeepSeek model on a hardware platform with "zero latency" like a Cerebras chip.
What happens?
A cost avalanche.
First, the slimmed-down model is small enough to be loaded entirely into the chip's built-in memory at once. Second, without the external memory bottleneck, the AI's response speed becomes astonishingly fast. The final result: training costs drop by 90% due to the MoE architecture, and inference costs drop by another order of magnitude due to specialized hardware and sparse computation. In total, the cost of owning and operating a world-class AI could be just 10%-15% of the traditional GPU solution.
This isn't an improvement; it's a paradigm shift.
The Carpet is Being Pulled from Under NVIDIA's Throne
Now you should understand why this is more fatal than the "Cowork panic."
NVIDIA's multi-trillion dollar valuation today is built on a simple story: AI is the future, and the future of AI depends on my GPUs. But now, the foundation of that story is being shaken.
In the training market, even if NVIDIA maintains its monopoly, if customers can do the work with one-tenth the GPUs, the overall size of this market could shrink significantly.
In the inference market, a cake ten times larger than training, NVIDIA not only lacks an absolute advantage but is also facing a siege from various players like Google and Cerebras. Even its biggest customer, OpenAI, is defecting.
Once Wall Street realizes that NVIDIA's "shovels" are no longer the only, or even the best, option, what will happen to the valuation built on the expectation of "permanent monopoly"? I think we all know.
Therefore, the biggest black swan in the next six months might not be which AI application has killed what, but a seemingly insignificant piece of tech news: for example, a new paper on the efficiency of MoE algorithms, or a report showing a significant increase in market share for dedicated inference chips, quietly announcing a new phase in the computing power war.
When the "shovel seller's" shovels are no longer the only choice, his golden age may well be over.