Crossing the 'Memory Wall': The Wafer-Level Revolution and Computing Power Routes in the AI Inference Era

marsbitPublicado a 2026-06-05Actualizado a 2026-06-05

Resumen

In 2026, a historic shift occurred in AI as major cloud providers' inference spending surpassed training spending for the first time, signaling a move from "building large models" to "using large models." This shifts the core challenge from computing power to the "memory wall"—the bottleneck of data movement (model weights, activations, KV Cache) between external DRAM and processors, where energy and latency from data transfer far exceed computation itself. Companies like Nvidia face GPU idle time due to bandwidth limits. In contrast, Cerebras Systems adopts a radical "wafer-scale" approach with its Wafer-Scale Engine (WSE). Instead of cutting a silicon wafer into many chips, Cerebras uses almost the entire wafer as one massive chip (WSE-3). This design provides 44GB of on-chip SRAM, delivering memory bandwidth thousands of times higher than traditional HBM (e.g., 21 PB/s vs. Nvidia B200). For LLM inference, weights are streamed layer-by-layer from external MemoryX storage to the chip, avoiding HBM bottlenecks. This results in token generation speeds 1.5–5 times faster than Nvidia's B200 in some models and significant advantages in first-token latency and long-context tasks. Additionally, Cerebras's architecture offers much lower interconnect power consumption (0.15 pJ/bit vs. GPU's ~10 pJ/bit). However, Cerebras faces challenges: SRAM scaling has slowed with advanced nodes, limiting future capacity gains; the chip requires specialized liquid cooling and custom software sta...

In 2026, the global development of AI reached a landmark inflection point—the capital expenditure on inference by hyperscale cloud vendors historically exceeded that on training for the first time. The industry's anchor shifted from 'training large models' to 'using large models,' fundamentally flipping the structure of computing demand.

In the training era, the core challenge of computing power was 'double-precision floating-point and cluster scale'; entering the inference era, the core challenge became 'memory bandwidth and communication latency.'

The bottleneck for large model inference is no longer merely computation, but data movement—model weights, intermediate activations, and KV Cache need frequent interaction between off-chip DRAM (like HBM) and GPUs. The larger the model, the higher the energy consumption and latency of data transfer, ultimately far exceeding the energy consumption of computation itself, thus forming the memory wall.

NVIDIA GPUs have built a solid fortress with CUDA and NVLink, but still cannot avoid GPU idling caused by bandwidth bottlenecks.

A domestic large model company, Zhipu, conducted a simple experiment: In a 512-card inference cluster, keeping the GPUs, model, and code unchanged, but only upgrading the network bandwidth cap from 200GB/s to 400GB/s, inference throughput directly increased by 10%, and first-token output latency decreased by 19%—the principle is simple: widen the road, and the cars can run faster.

However, non-GPU architectures represented by Cerebras seem to be tearing an opening in this memory wall.

Size comparison between Cerebras WSE-3 chip and NVIDIA B200 GPU

The Essence of Cerebras: A Near-Memory Computer Based on SRAM

Cerebras Systems was founded in Silicon Valley by Andrew Feldman and others. The early founding team all came from SeaMicro, a low-power microserver company that was later acquired by AMD. Subsequently:

In 2015, the founding team established the 'wafer-scale computing' route.

In 2016, they completed registration and Series A financing, entering a stealth R&D phase.

In 2019, they released their first product, the WSE-1 chip and CS-1 system, based on TSMC's 16nm process.

In 2021, they released the second-generation product, based on TSMC's 7nm process.

In 2024, they released the third-generation product (WSE-3 / CS-3), based on TSMC's 5nm process. Both the chip and system are manufactured entirely in the USA, making it a genuinely pure US-made chip system.

CS-3 system configuration, containing 1 WSE-3 chip

Cerebras's Wafer-Scale Engine (WSE) architecture philosophy is simple, direct, yet hits the pain point: trade extreme physical expansion for extreme compression of data movement latency.

Ordinary chips slice a wafer into many small chips; NVIDIA GPUs follow this approach. Cerebras does the opposite: don't slice, directly turn nearly the entire wafer into one giant chip, called the Wafer-Scale Engine (WSE).

Traditional chips are formed by cutting a 300mm diameter wafer into hundreds of small chips; Cerebras chooses to keep the entire wafer intact, using it directly as the whole chip. The latest WSE-3 boasts 4 trillion transistors, 900,000 AI cores, each equipped with 48KB of local SRAM, giving the entire chip 44GB of on-chip SRAM, providing 21 PB/s of on-chip memory bandwidth and 214 Pb/s of fabric bandwidth—thousands of times the bandwidth of traditional HBM.

Cerebras WSE's memory bandwidth is 2625 times that of NVIDIA's B200 packaged chip, breaking the memory bandwidth bottleneck in large model inference scenarios.

In Cerebras's architecture, model weights are never stored on the SRAM but reside in off-chip storage (MemoryX) and are transferred layer by layer to the giant chip. The approach involves separating the storage of neural network model weights from the computing units.

All model weights are stored externally in the MemoryX memory extension module. The weights required for computing each layer of the network are transmitted layer by layer to the CS-3 system on demand. Weights are stored in the DRAM and flash memory of MemoryX and transmitted to the CS-3 system at full bandwidth rates. These weights are not stored in the CS-3 system—not even temporarily cached—CS-3 relies on the core's underlying dataflow mechanism to complete computations.

Leveraging its wafer-scale architecture, Cerebras demonstrates barrier-breaking advantages in LLM inference constrained by memory bandwidth. When generating tokens sequentially, weights are streamed layer by layer from off-chip MemoryX to CS-3. For different models, the token rate is 1.5 to 5 times that of NVIDIA's B200.

NVIDIA DGX B200 GPU versus Cerebras CS-3 chip, token rate comparison when running different large models

Its core advantage lies in: The 44GB of on-chip SRAM in CS-3 provides 21 PB/s of ultra-high bandwidth (2625 times that of B200) and 214 Pb/s of interconnect bandwidth, freeing weight streaming from HBM interface limitations. Therefore, it performs exceptionally well in TTFT (Time To First Token), long-context, and agent workload scenarios.

Although weights are external to MemoryX, loaded layer by layer on demand, and not cached on-chip, CS-3 relies on the core's dataflow mechanism to perform lossless full FP16 precision computations in SRAM; leveraging linear performance scaling, it also unleashes astonishing total throughput in multi-user concurrent inference.

Besides bandwidth, there is also a power advantage. Recently, in a speech, Sutong Liu, Chairman of InnoLight, mentioned that customers' requirement for optical modules is 1 pJ/bit, while the current level is 10 pJ/bit. In Cerebras chips, the interconnect power consumption is only 0.15 pJ/bit, whereas the current GPU interconnect power consumption is 10 pJ/bit.

Bandwidth and power consumption comparison between Cerebras interconnect and GPU interconnect architectures

Thus, if Cerebras's wafer-scale large-chip architecture becomes mainstream for AI inference or even training, it might significantly suppress and structurally alter the shipment volumes of traditional optical modules and CPO (Co-Packaged Optics). The core logic is: the high demand for optical modules and CPO essentially aims to solve the bandwidth bottleneck of 'chip-to-chip interconnect' and 'node-to-node interconnect' in GPU clusters; Cerebras's architecture precisely solves the problem by 'eliminating distributed interconnects.'

Counterintuitive: The 'True and False' Fatal Flaws of Wafer-Scale Large Chips

The core of chips always lies in Trade-Off. To achieve extreme on-chip SRAM bandwidth, Cerebras also brings some issues.

Low Yield?

Quite the opposite. The size of a single AI core is reduced to 0.05 square millimeters (1% the size of a single H100 compute core), resulting in higher yield. By routing on-chip, defective cores can be disabled and bypassed, improving defect tolerance by 100 times compared to traditional multi-core processors. The chip actually has 1 million AI cores, but considering yield, it is advertised as having 900,000 AI cores.

Only Good at Inference, Not at Training?

In the years following Cerebras's founding, training was the mainstream topic, so the company focused heavily on training. It's only after inference demand surged that people realized its advantages in inference were more pronounced.

In reality, simplified distributed computing also brings advantages like reduced code complexity and lower communication overhead.

Training a 175-billion-parameter model on 4000 GPUs typically requires about 20,000 lines of distributed training code.

Cerebras achieves equivalent training with 565 lines of code—the entire model can be placed on the wafer without dealing with data parallelism complexity.

SRAM Scaling is Dead; Core Advantage Faces a Physical Ceiling.

The third-generation product is based on TSMC's 5nm, and its SRAM capacity only increased by 10% compared to the second-generation product based on TSMC's 7nm. Beyond 5nm, SRAM cell area hardly shrinks with process node advancement.

This means Cerebras can no longer significantly increase its core advantage (SRAM capacity) by upgrading TSMC's process nodes (e.g., from 5nm to 3nm) as it did before.

Limited by wafer size, cooling capability, and manufacturing costs, on-chip storage resources like SRAM are difficult to scale linearly with computing cores, encountering a resource ratio bottleneck. This almost blocks its evolutionary path.

Technical specifications of Cerebras's three product generations

The Triple Purgatory: Cooling, Process, and Ecosystem.

The entire wafer concentrates heat, leading to high heat flux density, necessitating reliance on customized data centers and dedicated liquid cooling systems. Moreover, ecosystem compatibility means customers must adapt to its customized software stack, with weak compatibility with existing general-purpose programming frameworks like CUDA, leading to high software porting and adaptation costs.

Low Off-Chip Bandwidth Creates an Expansion 'Island'.

Due to the limitations of wafer-scale physical design, the number of I/O pins that can be led out from the edge of the WSE is extremely limited, resulting in an I/O bandwidth of only 150GB/s. Compared to NVIDIA NVLink's 1.8TB/s bi-directional bandwidth, it's like a snail. This makes it extremely difficult for WSE to scale out at high speeds. Although Cerebras's SwarmX interconnect performs decently in multi-system combinations, in the face of super-large models requiring high-speed multi-chip interconnection, the extremely low off-chip bandwidth becomes a structural physical shackle.

Route Competition: Big Tech In-House Development—How Much Window Does Cerebras Have Left?

Big tech companies have multiple parallel paths to address 'inference requiring higher bandwidth + lower latency,' not just wafer-scale. They are encircling and suppressing the technological dividends of startups like Cerebras through three concurrent approaches.

1 In-House ASIC Development

Google TPU v8 has already split into training-specific and inference-specific versions; AWS Trainium 4 is on the way; Microsoft Maia is already in use within Azure, built on TSMC's 3nm process, with native FP8/FP4 tensor cores, a redesigned memory system equipped with 216GB HBM3e, and 272MB on-chip SRAM; even Anthropic has begun evaluating in-house inference chip development.

This path is highly probable and will directly cause the TAM (Total Addressable Market) for 'third-party inference procurement' in 2028 to be compressed by 10% to 25%.

2 Process Generalization of the Standard Packaging Route

This is the most direct dimensional reduction attack on Cerebras.

TSMC's SoW (System-on-Wafer) is already widely open to customers, and CoWoS 9.5x interposer will launch in 2027.

What these two products do—stitching multiple dies at the wafer level—essentially makes Cerebras's physical process generic and accessible to all.

NVIDIA's Vera Rubin will enter this ecosystem in the second half of 2026.

Although Cerebras's own cross-reticle stitching is exclusive, the exclusivity window lasts at most 2 to 3 years. Beyond 2027-2028, its process barrier will be diluted by TSMC's advanced packaging.

3 Breakthrough of Optical Interconnect/Optical Computing

The interconnect and memory wall of electronic chips have reached their limits. Photonics' high bandwidth, low latency, and zero crosstalk are the ultimate solution.

Optical routes represented by Lumentum are rising. The biggest advantage of wafer-scale is on-chip computing, but models will inevitably grow larger, making high-speed interconnect beyond wafer scale a necessity.

With the maturity of CPO (Co-Packaged Optics) and Optical Interconnects, it's highly likely we will see optical I/O directly introduced into WSE wafers, breaking the shackles of electrical interconnect. NVIDIA might also acquire companies with specific architectural advantages like LPU (e.g., Groq), combine them with optical interconnects, and develop wafer-scale systems compatible with existing NV super-node software.

Sprinting on the Cliff: Cerebras's Business and Delivery

Cerebras is currently facing a cliff-edge sprint forced by massive orders.

Deals with leading clients like OpenAI are forcing Cerebras to transform from a chip company into a new type of cloud service provider. It no longer just sells hardware but needs to lock in and build massive data center power and facilities in the short term.

According to contract requirements, Cerebras needs to deliver 250MW of data center capacity annually from 2026 to 2028. However, wafer-scale systems have extremely high requirements for data center rooms and cannot be directly placed into traditional air-cooled IDCs. Currently, Cerebras's progress in preparing data center capacity is significantly behind the contract requirements.

From tape-out to factory construction, from power approval to cooling system deployment—this is a quagmire of heavy assets and long cycles.

Epilogue: Left or Right?

Returning to the initial proposition, as the inflection point for inference computing power has arrived, the core of computing architecture always lies in trade-offs.

There is no absolute right or wrong, only the relatively optimal solution for the most important workloads. And workloads are already changing.

Cerebras goes left, choosing extreme physical optimization, trading the entire wafer and massive SRAM for extreme low latency for a single task, which is unbeatable in scenarios extremely sensitive to first-token latency.

NVIDIA goes right, choosing to maintain generality, using HBM + NVLink + massive cluster throughput to handle ever-changing workloads, responding with constancy to change.

The winds are shifting, and the road ahead is uncertain. It is precisely this dual uncertainty of technology and business that breeds the possibility of disruption. In the torrent of computing power flowing towards AGI, it is still too early to draw conclusions—because of uncertainty, there is opportunity.

This article is from the WeChat public account "Garlic Particle Machine Research Institute," author: Pili Youxia (Thunderbolt Ranger)

Preguntas relacionadas

QWhat is the key structural shift in the global AI industry in 2026, as identified in the article, and what does it signify?

AIn 2026, the global AI industry reached a pivotal inflection point where for the first time, capital expenditure on inference by hyperscale cloud providers surpassed that on training. This marks a fundamental shift in the focus of the industry from 'forging large models' to 'using large models', fundamentally flipping the structure of computing demand.

QWhat is the 'memory wall' in the context of large model inference, and how does Cerebras' WSE-3 architecture attempt to overcome it?

AIn large model inference, the 'memory wall' refers to the bottleneck caused by the energy consumption and latency of frequently moving data (model weights, intermediate activations, KV Cache) between off-chip DRAM (like HBM) and the GPU, which eventually far exceeds the energy cost of computation itself. Cerebras' WSE-3 architecture attacks this by using an entire wafer as a single, massive chip, packing 44GB of on-chip SRAM. This provides 21 PB/s of on-chip memory bandwidth, which is 2625 times the bandwidth of Nvidia's B200 GPU, drastically reducing data movement latency and breaking the memory bandwidth bottleneck for inference.

QAccording to the article, what are the three main parallel paths that major tech companies are taking to compete with specialized solutions like Cerebras?

AMajor tech companies are pursuing three parallel paths to address the need for higher bandwidth and lower latency in inference, thereby challenging specialized players: 1) Developing their own ASIC chips (e.g., Google TPU v8, AWS Trainium 4, Microsoft Maia). 2) Adopting standardized packaging processes like TSMC's SoW (System-on-Wafer) and CoWoS, which essentially democratize wafer-scale integration techniques. 3) Exploring breakthroughs in optical interconnects/computing (e.g., CPO, Optical Interconnects) to overcome the limits of electrical interconnects.

QWhat are the main potential weaknesses or challenges of the Cerebras wafer-scale chip (WSE) architecture, as outlined in the article?

AThe article highlights several challenges for Cerebras' wafer-scale architecture: 1) A physical scaling limit for SRAM, as SRAM cell area barely shrinks with process nodes beyond 5nm, blocking a key path for increasing its core advantage. 2) Significant thermal management challenges requiring specialized liquid-cooled data centers. 3) A weak software ecosystem and compatibility with existing frameworks like CUDA, leading to high adaptation costs. 4) Very low off-chip I/O bandwidth (150GB/s) compared to alternatives like NVLink, making the system a potential 'island' and hindering high-speed scaling for very large models.

QWhat critical business challenge is Cerebras currently facing due to large customer contracts, according to the article?

AFacing massive orders from leading customers like OpenAI, Cerebras is being forced into a 'cliff-side sprint' to transition from a chip company to a new type of cloud service provider. Its contracts reportedly require the delivery of 250MW of data center capacity annually from 2026 to 2028. However, building specialized data centers for its wafer-scale systems, which require unique power, cooling (liquid), and facility approvals, is a heavy-asset, long-cycle process where Cerebras' progress is already significantly lagging behind the contractual requirements.

Lecturas Relacionadas

Anthropic's IPO Launch: Commercial Miracle or Valuation Bubble?

Anthropic has confidentially filed for an IPO, led by Morgan Stanley and Goldman Sachs, potentially going public by October. Following its latest $650 billion funding round, its pre-IPO valuation stands at $965 billion, with projections reaching up to $2 trillion at listing, which would make it the highest-valued private company ever. The article, written by Fu Sheng, addresses skepticism that this represents an AI bubble akin to the 2000 dot-com crash. It argues the current situation differs fundamentally. Unlike the internet bubble era, which relied on speculative narratives with little revenue, Anthropic's valuation is backed by unprecedented, measurable financial performance. Key data points include: * **Revenue Growth:** ARR skyrocketed from $10 billion in early 2025 to $470 billion by May 2026, targeting $100 billion by year-end—a growth curve unmatched in business history. * **Profitability:** It achieved operating profitability in Q2 2026 with an estimated $5.6 billion profit. * **Efficiency:** With ~3,000 employees and ~$470 billion ARR, its revenue per employee exceeds $10 million. Products like Claude Code, launched less than a year ago, already generate $25 billion in annualized revenue. * **Enterprise Adoption:** It boasts a strong enterprise client base, with 8 of the Fortune 10 and over 1,000 large firms spending over $1 million annually on Claude. The valuation is framed using a traditional SaaS model (e.g., a 10x Price-to-Sales multiple on $100 billion revenue). The author contends the core question for analysts has shifted from "How big could this be?" to "How much is it earning and will earn next quarter?" The discussion extends beyond Anthropic to a broader paradigm shift: the transition from a "carbon-based" to a "silicon-based" economy. Companies are increasingly prioritizing investment in compute and AI capabilities over human resources, as these directly scale productivity and competitive advantage. Anthropic's IPO is thus positioned not just as a corporate milestone, but as a price anchor for this new economic era.

链捕手Hace 50 min(s)

Anthropic's IPO Launch: Commercial Miracle or Valuation Bubble?

链捕手Hace 50 min(s)

Near Returns to the AI Stage: Transformation into a Public Chain Due to 'Payroll Difficulties,' Agent and Privacy Emerge as New Growth Narratives

NEAR Returns to AI Origins: From Payroll Struggles to Blockchain, Now Focusing on AI Agents and Privacy NEAR Protocol's journey began not with grand blockchain ambitions, but from a practical hurdle: its AI startup founders, including Transformer paper co-author Illia Polosukhin, couldn't efficiently pay international developers in 2017. This led them to pivot and build a high-performance, scalable blockchain. After years navigating various crypto narratives like sharding and cross-chain interoperability, NEAR is now leveraging its AI roots to re-enter the AI arena. A key driver is its "NEAR Intents" layer, which abstracts complex cross-chain transactions. Users simply state their goal (e.g., swap BTC for ETH), and a solver network finds the optimal route. This system has processed over $20B in cross-chain volume, generating significant fee revenue. A major growth area is private transactions via "Confidential Intents/Swaps," which hide trade details until settlement to protect against MEV and front-running. Remarkably, private swaps recently accounted for over 40% of NEAR's transaction volume, highlighting strong demand but also potential regulatory scrutiny. With its AI-founder pedigree, NEAR is positioning itself at the intersection of blockchain, AI agents, and privacy, aiming to become infrastructure for the emerging agent economy while navigating the challenges of its rapid adoption.

marsbitHace 3 hora(s)

Near Returns to the AI Stage: Transformation into a Public Chain Due to 'Payroll Difficulties,' Agent and Privacy Emerge as New Growth Narratives

marsbitHace 3 hora(s)

From Ethereum to AI's 'CROPS': What Exactly is This Set of 'Slow Variables' That Vitalik Repeatedly Emphasizes?

In recent discussions, Vitalik Buterin has frequently emphasized the concept of "CROPS," a framework defining core values for Ethereum's development. CROPS stands for Censorship Resistance, Capture Resistance, Open Source, Privacy, and Security. Initially outlined in the Ethereum Foundation's "EF Mandate," it represents a commitment to user sovereignty, ensuring that the network resists external control, remains open, protects privacy, and prioritizes security. The relevance of CROPS extends beyond Ethereum's foundational principles, becoming crucial in the context of AI integration. As AI agents begin handling wallet operations and automated transactions, the risk increases that users may cede control over their digital assets, privacy, and intentions to centralized AI service providers. A "CROPS AI" would therefore emphasize local execution where possible, privacy-preserving remote model calls (e.g., using zero-knowledge proofs), and transparent, verifiable processes to maintain user agency. Vitalik highlights a significant convergence between "CROPS Ethereum access layer" and "CROPS AI." Both address the same fundamental challenge: how users can access powerful services—be it blockchain data via RPCs or AI models—without exposing sensitive information or relinquishing ultimate control. This intersection points toward a future digital entry point that is more private, secure, and user-controlled. Ultimately, CROPS is not merely an abstract ideal but a practical guidepost. It steers development—from protocol resilience and wallet design to AI agent safety—towards a future where users retain self-sovereignty even as digital systems grow more complex and powerful. In an era of accelerating AI adoption, these "slow variables" of censorship resistance, openness, privacy, and security may define Ethereum's enduring value.

marsbitHace 3 hora(s)

From Ethereum to AI's 'CROPS': What Exactly is This Set of 'Slow Variables' That Vitalik Repeatedly Emphasizes?

marsbitHace 3 hora(s)

Trading

Spot
Futuros

Artículos destacados

Cómo comprar ERA

¡Bienvenido a HTX.com! Hemos hecho que comprar Caldera (ERA) sea simple y conveniente. Sigue nuestra guía paso a paso para iniciar tu viaje de criptos.Paso 1: crea tu cuenta HTXUtiliza tu correo electrónico o número de teléfono para registrarte y obtener una cuenta gratuita en HTX. Experimenta un proceso de registro sin complicaciones y desbloquea todas las funciones.Obtener mi cuentaPaso 2: ve a Comprar cripto y elige tu método de pagoTarjeta de crédito/débito: usa tu Visa o Mastercard para comprar Caldera (ERA) al instante.Saldo: utiliza fondos del saldo de tu cuenta HTX para tradear sin problemas.Terceros: hemos agregado métodos de pago populares como Google Pay y Apple Pay para mejorar la comodidad.P2P: tradear directamente con otros usuarios en HTX.Over-the-Counter (OTC): ofrecemos servicios personalizados y tipos de cambio competitivos para los traders.Paso 3: guarda tu Caldera (ERA)Después de comprar tu Caldera (ERA), guárdalo en tu cuenta HTX. Alternativamente, puedes enviarlo a otro lugar mediante transferencia blockchain o utilizarlo para tradear otras criptomonedas.Paso 4: tradear Caldera (ERA)Tradear fácilmente con Caldera (ERA) en HTX's mercado spot. Simplemente accede a tu cuenta, selecciona tu par de trading, ejecuta tus trades y monitorea en tiempo real. Ofrecemos una experiencia fácil de usar tanto para principiantes como para traders experimentados.

372 Vistas totalesPublicado en 2025.07.17Actualizado en 2026.06.02

Cómo comprar ERA

Discusiones

Bienvenido a la comunidad de HTX. Aquí puedes mantenerte informado sobre los últimos desarrollos de la plataforma y acceder a análisis profesionales del mercado. A continuación se presentan las opiniones de los usuarios sobre el precio de ERA (ERA).

活动图片