Author: Frank Fu, IOSG
The gap proposed by David Cahn in 2023 has never been filled on the training side. It was filled on the inference side, and the market has only started pricing it in over the past few weeks. When Nvidia restructured its financial reporting around "served tokens," and Cerebras's IPO received 20 times oversubscription, the battle over the bottleneck is over. The real question becomes the next one: when inference becomes a scarce resource, where in the compute stack will the value accrue?
I. Following the GPU: From a $200 Billion Problem to a $600 Billion Problem
In 2023, Sequoia's David Cahn raised the question hanging over the entire AI buildout: the "$200 Billion Problem." For every $1 spent on a GPU, roughly another $1 is needed to power it in a data center, meaning each year's GPU CapEx implies these chips must ultimately generate about $200 billion in revenue to recoup that capital. Even under very generous assumptions for AI revenue, he found a gap exceeding $125 billion between "investment" and "what end customers actually pay." The concern was straightforward: GPUs were being overbuilt ahead of real demand.
A year later, the gap hasn't narrowed; it has widened. In Cahn's 2024 follow-up, as hyperscaler CapEx ballooned, he redefined it as the "$600 Billion Problem." The bearish logic converged into a familiar shape: overbuilding leads to oversupply, and oversupply burns capital.
Both articles essentially ask the same thing: who will fill this gap? The answer never appeared on the "training" side of the ledger. It appeared on the "inference" side, and the market has only started pricing it in over these past few weeks.
II. The Cerebras IPO and the Inference Squeeze
Cerebras went public on Thursday. The IPO received 20 times oversubscription, priced at nearly twice the final uptick from Wednesday. The demand wasn't from betting on the "next Nvidia killer," but stemmed from something simpler: the market began to realize that the real bottleneck in AI is inference, not training.
Cerebras's core competency is a chip architecture that makes inference extremely fast. Not training, but inference. This is precisely what excites Wall Street. The inference market is recurring, expanding with usage. Every time Claude answers a question, every time an agent performs a task, compute is consumed. Training happens once; inference never stops.
J.P. Morgan estimates the inference market to be 10 to 50 times the size of training. When machines start executing tasks assigned by other machines, i.e., agentic expansion, inference demand no longer scales with the number of users, but with compute itself.
III. Nvidia Redraws the Map: Inference Makes Headlines
If Cerebras represents the market's awakening, Nvidia's latest earnings report is confirmation from the top of the supply chain. On the latest earnings call, Jensen Huang made the unspoken explicit: AI demand is growing parabolically. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-off inference, to logical reasoning, and into the agent stage that calls tools and orchestrates tasks itself. Huang said, "Tokens are now profitable." In the AI era, compute equals revenue and profit.
This reshapes the entire industry. Training is a one-time cost to build a model; inference is the recurring cost to run it, and today's bottleneck is in inference, not training.
Nvidia wrote this assessment into its financial reporting format. It now reports by two platforms, not one: Data Center and Edge Computing. Data Center (about $75 billion for the quarter, +92% YoY) is further split into Hyperscale (about $38 billion, +12% QoQ) and ACIE, i.e., AI Cloud, Industrial & Enterprise (about $37 billion, +31% QoQ). The entirely new line is Edge Computing: $6.4 billion, +29% YoY, covering the endpoints where agentic AI and physical AI actually run, such as PCs, workstations, AI-RAN base stations, robots, and cars.
Edge currently still accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside Data Center. This signal is: inference is splitting into two fronts, cloud inference in data centers, and endpoint inference at the edge, where AI sees, moves, and acts in the physical world. The roadmap follows the same logic: the Vera Rubin, shipping starting Q3, offers up to 35 times the inference throughput of Blackwell; Huang also gave a new $200 billion TAM for the Vera CPU designed for agentic workloads. Every frontier model company is expected to fully transition to it on day one.
When the world's highest-valued company restructures its financial disclosures around "served tokens," the debate over the bottleneck is settled. The remainder of this article discusses who captures the value when inference (not training) becomes the scarce resource.
First, a scope note. Among these two fronts, this article discusses cloud inference, i.e., API token services provided from rented data center GPUs. Endpoint inference runs on local chips inside the device itself (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU rental and aggregation stack. For our purposes, consider it a tailwind amplifying the overall inference economy and supporting the bottleneck thesis, not the market where Hyperbolic and Venice operate, which are entirely on the cloud front.
IV. The Squeeze is Here
Anthropic is the canary in the coal mine. Usage far exceeded pre-configured capacity, with complaints about Claude being "lobotomized" flooding the internet, including rate-limited replies, slower inference, and compressed context windows. The solution was raw compute: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with 220k+ Nvidia GPUs, 300+ megawatts, and dedicated it to inference, not training.
This capacity unlocked a series of quota changes, each a signal. On May 6, Anthropic doubled Claude Code's five-hour limit, removed peak-hour throttling, and significantly increased Opus API rate limits. On May 13, it raised Claude Code's weekly limit by another 50% (through July 13). Then, starting June 15, it did the opposite of being "generous": it carved out agentic and programmatic usage (Agent SDK, headless mode claude -p, CI pipelines) from flat subscriptions into a separate metered credit pool ($20 to $200 per month, priced at API rates). This final step condensed the entire thesis into one action: agents consume inference faster than flat subscriptions were designed to handle, so they must be priced at their true "recurring cost."
Training is a one-time capital expenditure. Inference is a recurring operating cost, compounding with every new user, every new agent.
V. This Stack: Six Layers, One Bottleneck
Every AI application sits on a supply chain that starts at the TSMC fab and ends at the API endpoint:
Most companies own only one layer. Nvidia owns silicon, CoreWeave owns bare metal, Together AI owns inference optimization, OpenRouter owns model API routing.
Only one company is the exception.
VI. Hyperbolic: The Only Company Spanning Three Layers
Hyperbolic launched its on-demand GPU marketplace in June 2025. Within its first few months, it surpassed 200k+ developers, with adopters covering frontier AI labs, search, and major consumer platforms.
What's interesting is its architecture.
Hyperbolic doesn't own a single GPU itself. Every card comes from neoclouds and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This sounds like a weakness but is actually a moat.
By sitting between GPU suppliers and consumers, Hyperbolic sees real-time data others don't. It knows who is buying what GPU, at what price, at what time. It sees oversupply before it becomes public, sees demand surges before they hit the market.
Today, the moat is this multi-cloud aggregation itself. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized, unified pool, allowing developers to rent the cheapest available GPU anywhere without negotiating with each operator or managing a pile of accounts. The more clouds it connects, the deeper the liquidity, the richer the pricing data. Further out, the team is exploring using this data to model GPU price curves and eventually deploying its own capital to smooth supply and demand, acting as a market maker for physical compute; but this goal remains early. What's compounding today is the aggregation layer.
This is the flywheel:
-
Connect more clouds → More aggregated supply
-
More supply → Deeper market & real-time pricing data
-
Better data → Smarter routing today, pricing models long-term
-
Better liquidity & price → More developers → More clouds want to connect
No other company is attempting this. Hyperbolic is the only company spanning the GPU rental layer, deployment layer, and model API layer simultaneously.
VII. The Mirror of Venice
Venice is the clearest manifestation of the inference economy at the application layer and a useful contrast to Hyperbolic's position. It is a privacy-first inference application: an OpenAI-compatible API, plus consumer-facing subscriptions (Free / Pro / Pro+ / Max), routing requests to about 75 models, roughly two-thirds of which are open-source or self-hosted (Llama, Mistral, Qwen, DeepSeek), with the rest being anonymized passthroughs to closed-source frontier models. The key point is, Venice does not own significant compute itself. It rents from undisclosed GPU partners and confidential computing suppliers (NEAR AI Cloud, Phala) and pays frontier labs for passthroughs, so its true cost of revenue is inference compute, not SaaS hosting.
What Venice actually sells is privacy. The "privatization" here isn't turning public compute into private property, but wrapping commoditized inference with a guarantee: no data retention, no training on data, anonymized requests, with some workloads running in TEEs, making the plaintext invisible even to the operators. The underlying compute is a commodity; the markup sells this privacy wrapper. Moreover, this guarantee is layered and not uniform: for open-source models running on its own controlled or TEE GPUs, it can achieve near end-to-end confidential computing; but for anonymized passthroughs to closed-source models like Claude, GPT, privacy is just stripping identity, and the frontier lab still processes your raw prompt. So the strongest privacy only covers the open-source portion; the frontier model portion is "anonymous" not "truly confidential." Venice's gross margin = subscription price − inference cost paid downstream, and the premium it can charge over the bare API price leans almost entirely on this privacy markup, which is also why it has thin margins and is constrained by frontier passthrough pricing.
The token design packages this inference demand. Venice runs on two tokens: VVV (staking and platform access) and DIEM, the latter being an inference credit, with each DIEM roughly equivalent to $1 of compute per day. Paid subscriptions trigger programmatic buyback and burn of VVV (Pro / Pro+ / Max about $2 / $5 / $10 respectively), while emissions follow a fixed decreasing schedule: 6M → 5M → 4M VVV per month, dropping to 3M on July 1st. The buybacks are real but discretionary and still modest: about $103k burned in April and May each, slowly climbing toward about $110k in June, far below the $200k per month line.
The fundamentals are healthier than the headlines. The widely circulated "$70 million ARR" figure almost certainly mistakes subscription renewals for net new customer acquisition; a more defensible observable range is closer to $6 to $15 million ARR. Underneath, traction is real: about 136k token-holding addresses, about 9.9 million website visits per month (about 330k daily), with new Pro subscriptions hovering around 1,400 per day. This is a real business, but a thin-margin one, whose economics are constrained by the compute it buys.
This is precisely why Hyperbolic sits one layer above it. If Venice is the gas station, Hyperbolic is the refinery. Venice buys compute from the same constrained supply everyone depends on; Hyperbolic aggregates, standardizes that fragmented supply, and sells it to Venice and all players like it. As inference demand grows, value accrues not only to applications consuming compute, but even more to the layer that aggregates and routes that compute, capturing the cost of revenue paid by those applications.
VIII. Why This Matters Now
Nvidia restructured its finances around "served tokens." Cerebras's IPO proves the market understands inference is the bottleneck. Anthropic scrambling for capacity proves this is a real problem. Agentic and physical AI will amplify demand by orders of magnitude, across cloud and edge lines.
And it also closes the loop on the "$600 Billion Problem" from the other side. Cahn's bearish logic—overbuilding, then oversupply—will likely be validated. But oversupply is precisely the optimal condition for a capital-light aggregator: when GPU prices fall and supply fragments across dozens of clouds, the player that owns no hardware and routes every workload to the cheapest available card profits from the spread, while operators holding depreciating GPUs bear the loss. Hyperbolic is long oversupply, not short it.
The ultimate winning company won't be the one with the most GPUs, but the one that can tell you which GPUs are available where, at what price, and route every workload to the place where it can run at the lowest cost.
Hyperbolic is building that company. Owning no GPUs itself, purely software, three layers deep, building the aggregation layer for ultimate inference compute.







