Where Is the AI Infrastructure Industry Chain Stuck?

marsbit发布于2026-04-21更新于2026-04-21

文章摘要

The AI infrastructure (AI Infra) industry chain is facing unprecedented systemic bottlenecks, despite the rapid emergence of applications like DeepSeek and Seedance 2.0. The surge in global computing demand has exposed critical constraints across multiple layers of the supply chain—from core manufacturing equipment and data center cabling to specialty materials and cleanroom facilities. Key challenges include four major "walls": - **Memory Wall**: High-bandwidth memory (HBM) and DRAM face structural shortages as AI inference demand outpaces training, with new capacity not expected until 2027. - **Bandwidth Wall**: Data transfer speeds lag behind computing power, causing multi-level bottlenecks in-chip, between chips, and across data centers. - **Compute Wall**: Advanced chip manufacturing, reliant on EUV lithography and monopolized by ASML, remains the fundamental constraint, with supply chain fragility affecting production. - **Power Wall**: While energy demand from data centers is rising, power supply is a solvable near-term challenge through diversified energy infrastructure. Expansion is further hindered by shortages in testing equipment, IC substrates (critical for GPUs and seeing price hikes over 30%), specialty materials like low-CTE glass fiber, and high-end cleanroom facilities. Connection technologies are evolving, with copper cables resurging for short-range links due to cost and latency advantages, while optical solutions dominate long-range scenari...

As groundbreaking AI applications like DeepSeek and Seedance 2.0 continue to emerge, global demand for computing power is surging at an unprecedented pace. However, behind this computing arms race, the AI infrastructure (AI Infra) industry chain is facing systemic bottlenecks like never before. From core equipment in chip manufacturing to a single copper cable in data centers, from specialty materials to cleanroom facilities, nearly every critical link is flashing a "red light."

Four Major "Walls" in Computing Power Development

The development of AI computing power is not just about improving chip performance; it is a complex systems engineering challenge involving computing, storage, transmission, and energy.

(1) Memory Wall: The First Shackle in the AI Inference Era

Currently, the AI industry is shifting its focus from large model training to inference, with global AI inference demand expected to surpass training scenarios by 2026. The explosion in AI inference demand directly drives the need for high-bandwidth memory (HBM) and high-capacity DRAM.

Although major memory chip manufacturers are planning to expand production capacity, it takes at least two years from investment to actual production line operation, meaning the supply shortage is unlikely to ease in the short term. New capacity is primarily set to come online in 2027 and beyond, leading to a structural mismatch in 2026 where demand grows rapidly while supply lags.

(2) Bandwidth Wall: The "Clogged Capillaries" of Data Flow

The speed of computing power improvement far exceeds that of data transmission. This contradiction has led to a severe "bandwidth wall" problem—data flow within chips, between chips, within server racks, and between data centers has become the performance bottleneck of the entire computing system.

The current bandwidth bottleneck is multi-layered: within chips, interconnect delays and power consumption between transistors are continuously rising; between chips, traditional PCB board interconnects can no longer meet the high-bandwidth, low-latency demands of AI chips; within server racks, interconnect bandwidth between servers has become a constraint for Scale Up (vertical scaling); between data centers, long-distance transmission bandwidth and latency limit the efficiency of Scale Out (horizontal scaling) and cross-regional computing power scheduling.

Estimates show that in current AI training clusters, the energy consumption of data movement already exceeds that of computation itself. How to unclog the "capillaries" of data flow and reduce transmission latency and power consumption is a critical issue that must be addressed for AI Infra development.

(3) Compute Wall: High-End Chip Manufacturing as the Fundamental Constraint

AI chip performance iteration heavily relies on advanced process technologies, and the production capacity of these advanced processes is entirely constrained by upstream high-end manufacturing equipment, particularly EUV (extreme Ultraviolet) lithography machines.

Currently, only ASML can produce EUV lithography machines globally, with extremely limited capacity and strict export controls. This directly results in a severe shortage of capacity for processes below 7nm, unable to meet the explosive demand for AI chips. As the global leader in AI chips, NVIDIA's delivery of high-end chips like the H100 and H200 has been constrained by TSMC's advanced process capacity, with lead times stretching to several months or even over a year.

More critically, chip manufacturing is a highly globalized industry chain; a break in any single link affects the entire production capacity. From raw materials like photoresists, target materials, and electronic special gases to key equipment like etching and deposition tools, there are varying degrees of monopoly and supply constraints. This makes high-end chip manufacturing capability the most challenging bottleneck to break through in the AI Infra industry chain.

(4) Power Wall: A Relatively Controllable Short-Term Challenge

Compared to the first three, the power wall is a relatively easier bottleneck to solve. AI data centers are major energy consumers; the annual electricity consumption of a single ultra-large data center campus can even exceed that of a medium-sized city with hundreds of thousands of people. Currently, global data center electricity consumption accounts for 2% to 3% of total global electricity use and is still climbing. But the power issue is essentially an infrastructure construction problem that can be addressed through diversified energy supply methods like gas turbines, fuel cells, and photovoltaics.

In the long run, with the development of renewable energy technologies and the improvement of energy infrastructure, power supply will not become the biggest mid-to-long-term bottleneck for AI computing power development. However, in some regions, short-term power supply pressures due to lagging grid construction may still limit the pace of data center construction.

The "Invisible Killer" of Capacity Expansion: Comprehensive Shortages in Equipment and Materials

The pace of AI chip capacity expansion is far slower than expected, with the core constraint not being the chips themselves but comprehensive shortages in upstream equipment and materials.

(1) Rapid Growth in Demand for Testing Equipment

AI chip technology upgrades are driving higher precision and efficiency requirements for testing equipment. Compared to ordinary logic chips, AI GPUs have a massive increase in signal ports, consuming more signal channel resources of testers; simultaneously, the surge in transistor count leads to a significant increase in corresponding test vector scale and per-chip testing time. More critically, while only a certain percentage of chips in traditional consumer electronics are tested, for AI chips, 100% of chips must be tested, often through multiple stages, to ensure the entire chipset operates normally. Driven strongly by AI computing demand and the memory market explosion, semiconductor test equipment (ATE) has become one of the fastest-growing categories in the semiconductor equipment sector.

Advantest, the world's largest chip test equipment supplier, also stated that it expects record highs for the fiscal year ending March 2026, with revenue projected to grow 37% and net profit more than doubling from the previous year.

(2) IC Substrates/Package Substrates: The "Choke Point" More Expensive Than Chips

Surprisingly, the biggest supply chain pain point for leading chip manufacturers like NVIDIA is not the chips themselves, but IC substrates (package substrates). IC substrates are key components connecting chips to PCB boards, providing electrical connection and physical support. AI chips have extremely high requirements for IC substrates—they need larger area, higher wiring density, better thermal performance, and lower signal loss. This also means their value is inevitably much higher than ordinary PCBs. Estimates show that IC substrates account for about 50% of the total packaging cost, and in advanced flip-chip packaging, this proportion can even reach 70%–80%. Depending on the resin material used, IC substrates are mainly divided into BT substrates and ABF substrates. BT substrates are primarily used for various memory chips, while ABF is more focused on logic chips like CPUs, GPUs, FPGAs, and ASICs.

According to incomplete statistics, since 2025, IC substrate prices have accumulated an increase of over 30%. The price hike is mainly due to two reasons: first, cost transmission from upstream raw materials—core materials like high-end glass fiber cloth and copper foil have been in continuous short supply since 2025, with the capacity gap不断扩大 (expanding); second, the explosion in demand for 2.5D/3D advanced packaging—high-end chips like GPUs普遍采用 (commonly adopt) multi-chip stacking architectures, and the significant increase in chip layers and area directly drives up the demand for substrate area.

Unlike ordinary PCBs, IC substrates have high technical barriers and complex processes. Global capacity for high-end IC substrates is mainly concentrated in a few Taiwanese manufacturers like Unimicron and Nan Ya PCB, with capacity expansion cycles as long as 18-24 months. This means the tight supply situation for IC substrates is unlikely to be fundamentally alleviated within the next two years.

(3) Key Specialty Materials: The Extremely Scarce "Industrial MSG"

Some seemingly insignificant specialty materials are becoming the "Achilles' heel" of the AI industry chain. Materials like Low-CTE (low coefficient of thermal expansion) glass fiber, specialty copper foil, and high-end drill bits, though used in small quantities, are indispensable "industrial MSG" for manufacturing high-end IC substrates and PCB boards.

The high power consumption and performance requirements of AI chips necessitate the use of materials with extremely low thermal expansion coefficients for substrates and PCBs to prevent deformation under high-temperature operating conditions. Simultaneously, as fillers are used, the lifespan of drill bits used in the加工过程 (processing) is drastically reduced to 1/5-1/7 of the original, leading to an explosive growth in demand for drill bits.

These specialty materials have extremely high technical barriers, global capacity is highly concentrated, and expansion is difficult. Any supply interruption will directly impact the normal operation of the entire AI industry chain.

(4) High-End Cleanrooms: The Overlooked High-Barrier Segment

In the AI industry chain's capacity expansion, high-end cleanrooms are another severely overlooked high-barrier segment. Advanced process chips and advanced packaging have extremely high requirements for production environment cleanliness—a single speck of dust in the air can cause an entire wafer to be scrapped.

The construction of high-end cleanrooms requires not only huge capital investment but also extremely high technical expertise. From air purification systems to anti-static facilities, from temperature and humidity control to vibration isolation, every环节 (aspect) has strict standards. Currently, the global high-end cleanroom market is mainly dominated by overseas companies, with net profit margins potentially exceeding 20%, far higher than domestic counterparts.

With the global expansion of AI chip capacity, demand for high-end cleanrooms remains strong, making it a segment with extremely strong certainty and high prosperity within the industry chain.

The "Route Dispute" in Connection Technology: Copper Resurgence and Photonic-Electronic Integration

Beyond computing and expansion bottlenecks, connection technology inside data centers is undergoing a profound transformation. The technological路线之争 (route dispute) between copper and light, along with the technological upgrades of PCB/substrates, is reshaping the connectivity landscape of AI Infra.

(1) Scenario-Based Competition and Substitution Between Copper and Light

For a long time, optical modules have been considered the future direction for high-speed interconnection in data centers. But with the explosion of AI computing demand, copper cable technology is experiencing a "resurgence," with copper and light forming a relationship of complementarity and substitution in different scenarios.

Short Distance (≤7 meters): Copper cables (AEC, Active Electrical Cables), with advantages of low cost, high reliability, and low latency, are comprehensively replacing laser-based optical modules. In short-distance interconnection scenarios within servers and within server racks, copper cables offer significant cost-performance advantages.

Medium Distance (~30 meters): Micro LED optical cables have become a compromise solution. They combine the advantages of copper cables and optical modules, offering better reliability than laser optical modules and lower cost than traditional optical modules, suitable for medium-distance interconnection between racks.

Long Distance (Between Data Centers): Traditional pluggable optical modules and fiber optics remain mainstream. CPO (Co-Packaged Optics) technology is considered the future direction; it integrates the optical engine with the chip package, significantly increasing bandwidth and reducing power consumption. However, it still faces challenges like high cost and poor reliability, and widespread commercial use is still some time away.

It is worth noting that the procurement scale and performance specifications for optical fiber in AI data centers have already created an order-of-magnitude difference compared to traditional telecom networks. To meet the low-latency, high-bandwidth interconnection needs of GPU clusters, demand for特种光纤 (specialty optical fibers) like G.657.A2 continues to rise, and more cutting-edge hollow-core fiber solutions have entered the deployment stage. Hollow-core fiber replaces the traditional glass core with air, significantly optimizing transmission: transmission loss can be reduced from the常规 (conventional) 0.14dB/km to below 0.1dB/km, transmission delay reduced from 5μs/km to 3.46μs/km, while tolerating higher optical power.

Currently, the number of participants in the hollow-core fiber market is expanding rapidly, but prices remain relatively stable, at about 30,000-40,000 RMB per kilometer, far higher than普通光纤 (ordinary optical fiber).

(2) Technological Upgrade Pressure on PCB/Substrates

To meet the high-bandwidth demands of AI chips, PCB and substrate technologies are also continuously upgrading. Currently, PCB/substrates are moving towards n+m layer structures, glass substrates, and modified Semi-Additive Process (mSAP) technology.

The n+m structure increases the number of layers and wiring density, enhancing the substrate's bandwidth capability; glass substrates have a lower coefficient of thermal expansion and better high-frequency performance, representing an important future direction for high-end substrates; mSAP technology enables finer circuit wiring, meeting high-density interconnection demands.

These technological upgrades place new demands (提出了全新的要求) on upstream equipment, materials, and manufacturing processes, also bringing new industrial opportunities and challenges.

Summary

The AI Infra industry chain is facing intertwined constraints from multiple bottlenecks. From the computing层面的 (level) memory wall, bandwidth wall, compute wall, and power wall, to expansion-level shortages in testing equipment, IC substrates, specialty materials, and cleanrooms, to the technological route dispute at the connection level, every环节 (link) affects the large-scale deployment of AI computing power.

High-end chip manufacturing capability is the most fundamental constraint, determining the performance上限 (upper limit) and production scale of AI chips. Testing equipment, high-end IC substrates, key specialty materials, etc., are currently the segments with the strongest certainty and the most acute supply-demand矛盾 (contradiction). In the long run, AI Infra development will show two major trends: first, the technological evolution of copper cable resurgence and photonic-electronic integration, where different technological routes will coexist in their respective advantageous scenarios; second, the restructuring of the global industry chain and the acceleration of localization, where domestic companies are expected to achieve breakthroughs in some细分领域 (segments).

This article is from the WeChat public account "Semiconductor Industry Vertical and Horizontal" (ID: ICViews), author: Peng Cheng

相关问答

QWhat are the four major bottlenecks (walls") mentioned in the article that are constraining AI infrastructure development?

AThe four major bottlenecks are: 1) The Memory Wall, caused by the shift to AI inference and the resulting shortage of HBM and DRAM. 2) The Bandwidth Wall, where data transfer speeds cannot keep up with computing power, creating a performance bottleneck. 3) The Compute Wall, where the manufacturing of high-end chips is fundamentally constrained by the limited supply of advanced equipment like EUV lithography machines. 4) The Power Wall, a relatively more solvable short-term challenge concerning the massive energy consumption of AI data centers.

QAccording to the article, what is a more immediate supply chain pain point for chipmakers like NVIDIA than the chips themselves?

AThe article states that the most immediate supply chain pain point for chipmakers like NVIDIA is not the chips themselves, but IC substrates (packaging substrates). These are the critical components that connect the chip to the PCB, and their production is constrained by high technical barriers and long expansion cycles of 18-24 months.

QHow is the 'Bandwidth Wall' problem described in the context of AI clusters?

AThe 'Bandwidth Wall' is described as a multi-level performance bottleneck where the speed of data movement cannot keep up with the speed of computation. This occurs within chips (interconnect delays), between chips (traditional PCB interconnects are insufficient), inside server racks (limiting scale-up), and between data centers (limiting scale-out). It's noted that the energy consumed by moving data in an AI training cluster already exceeds the energy consumed by the computation itself.

QWhat two key factors are driving the price increases and shortages of IC substrates?

AThe two key factors driving IC substrate price increases and shortages are: 1) Cost transmission from upstream raw materials like high-end glass fiber cloth and copper foil, which have been in short supply. 2) The explosive demand from 2.5D/3D advanced packaging, where multi-chip stacking architectures used in GPUs significantly increase the required substrate area.

QIn data center connectivity, what are the competing technological routes for different distance scenarios as outlined in the article?

AThe article outlines a scenario-based competition between copper and optical technologies: 1) Short distance (≤7m): Active Electrical Cables (AEC) are replacing optical modules due to lower cost and higher reliability. 2) Medium distance (~30m): Micro LED optical cables are a compromise solution. 3) Long distance (between data centers): Traditional pluggable optical modules and fiber optics remain the mainstream, with Co-Packaged Optics (CPO) seen as a future direction.

你可能也喜欢

苹果也得交租了

苹果与谷歌之间存在着两笔方向相反的“租金”交易,揭示了科技巨头在新时代下的博弈与依赖。 长期以来,谷歌每年向苹果支付约200亿美元,以换取其Safari浏览器上的默认搜索引擎位置,这是一笔基于流量入口稀缺性的“租金”。然而,这一根基正出现裂缝,AI搜索工具开始分流传统搜索流量。 与此同时,在AI模型的新战场上,形势逆转。由于自研前沿大模型能力不足,苹果选择每年支付约10亿美元,与谷歌达成合作,使用其Gemini模型来构建自身AI能力。苹果强调其最终产品“一滴Gemini的代码都没有”,主要通过“蒸馏”技术学习其输出,并保有切换供应商的框架设计。但本质上,苹果在模型的知识迭代和云端算力上仍对谷歌存在依赖。 这构成了一个微妙局面:在搜索旧战场,苹果是收租的“房东”;在AI新战场,苹果成了交租的“租客”。其未来地位取决于一个关键趋势:前沿大模型能力究竟是会“商品化”(变便宜、可替代),还是持续“集中化”(变昂贵、被垄断)。苹果的整套战略(租用模型、自研推理芯片、控制入口)押注于前者,即模型能力将变得普惠,使其能凭借设备生态维持主导权。 这种平台与生态参与者的关系也在向更广层面延伸。苹果、谷歌、微信等平台正不约而同地推动其生态内的应用将功能标准化、原子化,以供平台AI直接调用。对开发者而言,新时代的“租金”不再是应用商店抽成,而是交出交互控制权,以换取“被AI选中”的机会。稀缺资源从“下载曝光”变成了“AI调用权”,但平台收租的结构本质未变。

marsbit8分钟前

苹果也得交租了

marsbit8分钟前

区块链用了 18 年终于开始驶向主航道

本月初,老牌加密风投机构Variant完成新基金募集,将投资主题从“数字所有权”扩展为“自主性”。这背后传递出关键信号:加密正从一个独立赛道,转变为嵌入AI、金融、社交等主流领域的底层技术范式。面对AI浪潮的冲击,加密VC给出的回应并非直接竞争,而是寻求成为AI世界的底层金融轨道。 当前,加密市场自身财富效应减弱,叙事轮动带来的回报逻辑正在失灵。加密基金不仅要与其他加密基金竞争,更要与AI、机器人等所有代表未来的增长资产竞争。这迫使越来越多加密VC主动模糊边界,如Paradigm、Haun Ventures等已将投资范围扩展至AI和前沿科技。 文章指出,AI智能体可能是加密技术实现大规模落地的关键。加密不一定要成为用户直接使用的前台应用,而可以成为AI时代机器与机器、应用与应用之间的经济结算层。AI智能体需要钱包进行支付、需要开放账户体系携带身份、需要可验证的机制建立信任——这些正是加密技术多年积累的能力。 Tether投资德国机器人公司NEURA Robotics是这一趋势的典型案例。该公司的机器人平台计划集成Tether的钱包工具,未来机器人可通过完成任务获得微支付并自主交易,这为稳定币等加密基础设施开辟了全新的高频、小额应用场景。 然而,“AI+加密”并非万能公式。许多项目只是粗暴拼接,缺乏真实需求和产品壁垒。真正有价值的结合,应满足“没有加密就无法成立,或有了加密明显更好”的条件,例如为AI智能体提供自托管钱包、为数据市场提供开放结算等。 结语认为,加密行业亟需找到新的真实需求,而非仅仅依赖新叙事。当AI智能体和机器人成为新的经济参与者时,加密技术搭建的钱包、稳定币、智能合约等基础设施,或将首次迎来高频、刚需且非投机化的巨大应用场景。

链捕手46分钟前

区块链用了 18 年终于开始驶向主航道

链捕手46分钟前

交易

现货
合约
活动图片