When Inference Becomes a Scarce Resource, Who Captures the Value?

链捕手Опубликовано 2026-06-08Обновлено 2026-06-08

Введение

When Inference Becomes the Scarce Resource, Who Captures the Value? The core AI bottleneck has shifted from model training to inference (runtime execution). While concerns persisted about an "AI compute gap"—initially a $200B, now a $600B problem—the market is now recognizing that the solution and value lie in the inference layer. Nvidia's financial restructuring around "serving tokens" and Cerebras's successful IPO highlight this shift. Inference is a recurring, usage-based cost, estimated to be 10-50x larger than the one-time training market, especially with the rise of agentic AI. The inference stack spans six layers: silicon (e.g., Nvidia), bare metal (e.g., CoreWeave), GPU rental/aggregation, deployment/optimization, model APIs, and end applications. Most companies operate in one layer. However, Hyperbolic uniquely spans three layers (GPU rental, deployment, and model APIs) without owning any hardware. It aggregates fragmented GPU supply from multiple cloud providers into a standardized pool, offering developers the cheapest available compute through intelligent routing. Its multi-cloud aggregation creates a data moat and a flywheel: more supply leads to better pricing data and liquidity, attracting more developers and providers. In contrast, applications like Venice operate at the top of the stack, reselling privacy-wrapped inference but remaining dependent on and constrained by the underlying compute costs they purchase. As inference demand explodes, value accrues n...

Author: Frank Fu, IOSG

The gap proposed by David Cahn in 2023 has never been filled on the training side. It was filled on the inference side, and the market has only started pricing it in over the past few weeks. When Nvidia restructured its financial reporting around "served tokens," and Cerebras's IPO received 20 times oversubscription, the battle over the bottleneck is over. The real question becomes the next one: when inference becomes a scarce resource, where in the compute stack will the value accrue?

I. Following the GPU: From a $200 Billion Problem to a $600 Billion Problem

In 2023, Sequoia's David Cahn raised the question hanging over the entire AI buildout: the "$200 Billion Problem." For every $1 spent on a GPU, roughly another $1 is needed to power it in a data center, meaning each year's GPU CapEx implies these chips must ultimately generate about $200 billion in revenue to recoup that capital. Even under very generous assumptions for AI revenue, he found a gap exceeding $125 billion between "investment" and "what end customers actually pay." The concern was straightforward: GPUs were being overbuilt ahead of real demand.

A year later, the gap hasn't narrowed; it has widened. In Cahn's 2024 follow-up, as hyperscaler CapEx ballooned, he redefined it as the "$600 Billion Problem." The bearish logic converged into a familiar shape: overbuilding leads to oversupply, and oversupply burns capital.

Both articles essentially ask the same thing: who will fill this gap? The answer never appeared on the "training" side of the ledger. It appeared on the "inference" side, and the market has only started pricing it in over these past few weeks.

II. The Cerebras IPO and the Inference Squeeze

Cerebras went public on Thursday. The IPO received 20 times oversubscription, priced at nearly twice the final uptick from Wednesday. The demand wasn't from betting on the "next Nvidia killer," but stemmed from something simpler: the market began to realize that the real bottleneck in AI is inference, not training.

Cerebras's core competency is a chip architecture that makes inference extremely fast. Not training, but inference. This is precisely what excites Wall Street. The inference market is recurring, expanding with usage. Every time Claude answers a question, every time an agent performs a task, compute is consumed. Training happens once; inference never stops.

J.P. Morgan estimates the inference market to be 10 to 50 times the size of training. When machines start executing tasks assigned by other machines, i.e., agentic expansion, inference demand no longer scales with the number of users, but with compute itself.

III. Nvidia Redraws the Map: Inference Makes Headlines

If Cerebras represents the market's awakening, Nvidia's latest earnings report is confirmation from the top of the supply chain. On the latest earnings call, Jensen Huang made the unspoken explicit: AI demand is growing parabolically. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-off inference, to logical reasoning, and into the agent stage that calls tools and orchestrates tasks itself. Huang said, "Tokens are now profitable." In the AI era, compute equals revenue and profit.

This reshapes the entire industry. Training is a one-time cost to build a model; inference is the recurring cost to run it, and today's bottleneck is in inference, not training.

Nvidia wrote this assessment into its financial reporting format. It now reports by two platforms, not one: Data Center and Edge Computing. Data Center (about $75 billion for the quarter, +92% YoY) is further split into Hyperscale (about $38 billion, +12% QoQ) and ACIE, i.e., AI Cloud, Industrial & Enterprise (about $37 billion, +31% QoQ). The entirely new line is Edge Computing: $6.4 billion, +29% YoY, covering the endpoints where agentic AI and physical AI actually run, such as PCs, workstations, AI-RAN base stations, robots, and cars.

Edge currently still accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside Data Center. This signal is: inference is splitting into two fronts, cloud inference in data centers, and endpoint inference at the edge, where AI sees, moves, and acts in the physical world. The roadmap follows the same logic: the Vera Rubin, shipping starting Q3, offers up to 35 times the inference throughput of Blackwell; Huang also gave a new $200 billion TAM for the Vera CPU designed for agentic workloads. Every frontier model company is expected to fully transition to it on day one.

When the world's highest-valued company restructures its financial disclosures around "served tokens," the debate over the bottleneck is settled. The remainder of this article discusses who captures the value when inference (not training) becomes the scarce resource.

First, a scope note. Among these two fronts, this article discusses cloud inference, i.e., API token services provided from rented data center GPUs. Endpoint inference runs on local chips inside the device itself (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU rental and aggregation stack. For our purposes, consider it a tailwind amplifying the overall inference economy and supporting the bottleneck thesis, not the market where Hyperbolic and Venice operate, which are entirely on the cloud front.

IV. The Squeeze is Here

Anthropic is the canary in the coal mine. Usage far exceeded pre-configured capacity, with complaints about Claude being "lobotomized" flooding the internet, including rate-limited replies, slower inference, and compressed context windows. The solution was raw compute: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with 220k+ Nvidia GPUs, 300+ megawatts, and dedicated it to inference, not training.

This capacity unlocked a series of quota changes, each a signal. On May 6, Anthropic doubled Claude Code's five-hour limit, removed peak-hour throttling, and significantly increased Opus API rate limits. On May 13, it raised Claude Code's weekly limit by another 50% (through July 13). Then, starting June 15, it did the opposite of being "generous": it carved out agentic and programmatic usage (Agent SDK, headless mode claude -p, CI pipelines) from flat subscriptions into a separate metered credit pool ($20 to $200 per month, priced at API rates). This final step condensed the entire thesis into one action: agents consume inference faster than flat subscriptions were designed to handle, so they must be priced at their true "recurring cost."

Training is a one-time capital expenditure. Inference is a recurring operating cost, compounding with every new user, every new agent.

V. This Stack: Six Layers, One Bottleneck

Every AI application sits on a supply chain that starts at the TSMC fab and ends at the API endpoint:

Most companies own only one layer. Nvidia owns silicon, CoreWeave owns bare metal, Together AI owns inference optimization, OpenRouter owns model API routing.

Only one company is the exception.

VI. Hyperbolic: The Only Company Spanning Three Layers

Hyperbolic launched its on-demand GPU marketplace in June 2025. Within its first few months, it surpassed 200k+ developers, with adopters covering frontier AI labs, search, and major consumer platforms.

What's interesting is its architecture.

Hyperbolic doesn't own a single GPU itself. Every card comes from neoclouds and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This sounds like a weakness but is actually a moat.

By sitting between GPU suppliers and consumers, Hyperbolic sees real-time data others don't. It knows who is buying what GPU, at what price, at what time. It sees oversupply before it becomes public, sees demand surges before they hit the market.

Today, the moat is this multi-cloud aggregation itself. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized, unified pool, allowing developers to rent the cheapest available GPU anywhere without negotiating with each operator or managing a pile of accounts. The more clouds it connects, the deeper the liquidity, the richer the pricing data. Further out, the team is exploring using this data to model GPU price curves and eventually deploying its own capital to smooth supply and demand, acting as a market maker for physical compute; but this goal remains early. What's compounding today is the aggregation layer.

This is the flywheel:

  1. Connect more clouds → More aggregated supply

  2. More supply → Deeper market & real-time pricing data

  3. Better data → Smarter routing today, pricing models long-term

  4. Better liquidity & price → More developers → More clouds want to connect

No other company is attempting this. Hyperbolic is the only company spanning the GPU rental layer, deployment layer, and model API layer simultaneously.

VII. The Mirror of Venice

Venice is the clearest manifestation of the inference economy at the application layer and a useful contrast to Hyperbolic's position. It is a privacy-first inference application: an OpenAI-compatible API, plus consumer-facing subscriptions (Free / Pro / Pro+ / Max), routing requests to about 75 models, roughly two-thirds of which are open-source or self-hosted (Llama, Mistral, Qwen, DeepSeek), with the rest being anonymized passthroughs to closed-source frontier models. The key point is, Venice does not own significant compute itself. It rents from undisclosed GPU partners and confidential computing suppliers (NEAR AI Cloud, Phala) and pays frontier labs for passthroughs, so its true cost of revenue is inference compute, not SaaS hosting.

What Venice actually sells is privacy. The "privatization" here isn't turning public compute into private property, but wrapping commoditized inference with a guarantee: no data retention, no training on data, anonymized requests, with some workloads running in TEEs, making the plaintext invisible even to the operators. The underlying compute is a commodity; the markup sells this privacy wrapper. Moreover, this guarantee is layered and not uniform: for open-source models running on its own controlled or TEE GPUs, it can achieve near end-to-end confidential computing; but for anonymized passthroughs to closed-source models like Claude, GPT, privacy is just stripping identity, and the frontier lab still processes your raw prompt. So the strongest privacy only covers the open-source portion; the frontier model portion is "anonymous" not "truly confidential." Venice's gross margin = subscription price − inference cost paid downstream, and the premium it can charge over the bare API price leans almost entirely on this privacy markup, which is also why it has thin margins and is constrained by frontier passthrough pricing.

The token design packages this inference demand. Venice runs on two tokens: VVV (staking and platform access) and DIEM, the latter being an inference credit, with each DIEM roughly equivalent to $1 of compute per day. Paid subscriptions trigger programmatic buyback and burn of VVV (Pro / Pro+ / Max about $2 / $5 / $10 respectively), while emissions follow a fixed decreasing schedule: 6M → 5M → 4M VVV per month, dropping to 3M on July 1st. The buybacks are real but discretionary and still modest: about $103k burned in April and May each, slowly climbing toward about $110k in June, far below the $200k per month line.

The fundamentals are healthier than the headlines. The widely circulated "$70 million ARR" figure almost certainly mistakes subscription renewals for net new customer acquisition; a more defensible observable range is closer to $6 to $15 million ARR. Underneath, traction is real: about 136k token-holding addresses, about 9.9 million website visits per month (about 330k daily), with new Pro subscriptions hovering around 1,400 per day. This is a real business, but a thin-margin one, whose economics are constrained by the compute it buys.

This is precisely why Hyperbolic sits one layer above it. If Venice is the gas station, Hyperbolic is the refinery. Venice buys compute from the same constrained supply everyone depends on; Hyperbolic aggregates, standardizes that fragmented supply, and sells it to Venice and all players like it. As inference demand grows, value accrues not only to applications consuming compute, but even more to the layer that aggregates and routes that compute, capturing the cost of revenue paid by those applications.

VIII. Why This Matters Now

Nvidia restructured its finances around "served tokens." Cerebras's IPO proves the market understands inference is the bottleneck. Anthropic scrambling for capacity proves this is a real problem. Agentic and physical AI will amplify demand by orders of magnitude, across cloud and edge lines.

And it also closes the loop on the "$600 Billion Problem" from the other side. Cahn's bearish logic—overbuilding, then oversupply—will likely be validated. But oversupply is precisely the optimal condition for a capital-light aggregator: when GPU prices fall and supply fragments across dozens of clouds, the player that owns no hardware and routes every workload to the cheapest available card profits from the spread, while operators holding depreciating GPUs bear the loss. Hyperbolic is long oversupply, not short it.

The ultimate winning company won't be the one with the most GPUs, but the one that can tell you which GPUs are available where, at what price, and route every workload to the place where it can run at the lowest cost.

Hyperbolic is building that company. Owning no GPUs itself, purely software, three layers deep, building the aggregation layer for ultimate inference compute.

Связанные с этим вопросы

QWhat is the "$600 billion problem" mentioned in the article, and how does the answer relate to AI inference?

AThe "$600 billion problem" refers to a gap identified by Sequoia's David Cahn, where the massive capital expenditure (CapEx) on AI infrastructure, primarily GPUs, far exceeds the revenue currently generated by AI services. The article argues that this gap is being filled not on the training side but on the inference side, as inference is the persistent, recurring operational cost that scales with usage and is now becoming the true bottleneck and revenue driver for AI.

QWhy does the article argue that inference, not training, is the real bottleneck in the current AI ecosystem?

AThe article argues inference is the bottleneck because training is a one-time cost to build a model, while inference is the recurring, operational cost to run it. As AI moves into agentic and physical applications where AI models continuously perform tasks, inference demand scales parabolically with usage, not just users. This is evidenced by events like Anthropic's capacity struggles and Nvidia's new financial reporting structure focused on "serving tokens."

QHow does Hyperbolic's business model and position in the AI compute stack differ from other companies in the space?

AHyperbolic is unique because it operates across three layers of the compute stack (GPU rental/aggregation, deployment, and model API layer) without owning any physical GPUs. Its core business is aggregating fragmented GPU supply from multiple cloud providers into a standardized pool, providing developers with real-time pricing and availability data, and intelligently routing workloads to the most cost-effective available resources. This makes it an asset-light aggregator and optimizer, contrasting with companies that own hardware or operate at a single layer.

QAccording to the article, what strategic advantage does Hyperbolic gain from being an asset-light aggregator, especially in a potential market of GPU oversupply?

AHyperbolic's strategic advantage lies in its position as an asset-light aggregator. In a scenario of potential GPU oversupply and price decline, Hyperbolic benefits by routing workloads to the cheapest available resources without bearing the cost of owning and depreciating hardware. It profits from the price differential and market inefficiencies, effectively "going long on oversupply" while hardware owners shoulder the losses. Its multi-cloud aggregation provides deep liquidity and pricing data, strengthening its routing intelligence.

QWhat is Venice's role in the inference economy, and how does its economic model illustrate the dynamics described in the article?

AVenice is an application-layer player in the inference economy, offering a privacy-focused AI chat interface that routes user requests to various models (both open-source and proprietary). Its core product is privacy wrapping for commodity inference compute. Its economics are thin-margined, as its primary cost is the inference compute it purchases from upstream providers. This illustrates the article's point that value in the inference stack flows not just to end-user applications but to the layers that aggregate, route, and ultimately capture the cost of revenue that applications like Venice pay for compute.

Похожее

Huang Renxun Dramatically 'Saves' South Korean Stock Market

In early June, South Korea's stock market experienced a sharp decline, with the KOSPI index dropping over 5% and triggering a trading halt. Amid this volatility, NVIDIA CEO Jensen Huang's visit to Seoul provided a dramatic boost to market sentiment. During his trip, Huang held a dinner meeting with SK Group Chairman Chey Tae-won and SK Hynix CEO Kwak Noh-Jung. He announced that NVIDIA's new Vera CPU would utilize SK Hynix DRAM and confirmed a multi-year technical collaboration between the two companies. This partnership aims to co-develop next-generation memory for NVIDIA's AI infrastructure roadmap, covering products from data center supercomputers to personal AI devices. Huang also publicly commented that AI company stocks were attractively priced. A key announcement was that NVIDIA's upcoming Vera Rubin AI supercomputer systems will use HBM4 memory, with supply qualifications granted to all three major suppliers: SK Hynix, Samsung Electronics, and Micron Technology. Despite this multi-sourcing strategy, Huang warned that the industry-wide chip shortage, affecting everything from wafers to packaging, is expected to persist for several years due to relentless demand from global AI factory construction. The collaboration extends beyond memory supply. SK Hynix will employ NVIDIA's AI platforms and Omniverse digital twin technology to enhance its own semiconductor design, simulation, and manufacturing processes, aiming for more autonomous factory operations. This visit builds upon a prior October 2025 agreement for SK Group to build a large-scale AI data center using over 50,000 NVIDIA GPUs. Huang's itinerary also included meetings with other Korean giants like Hyundai, LG, and Samsung, indicating NVIDIA's broader strategy to deepen ties with South Korea's tech industry.

链捕手3 ч. назад

Huang Renxun Dramatically 'Saves' South Korean Stock Market

链捕手3 ч. назад

Торговля

Спот
Фьючерсы
活动图片