Deep Insight: Decentralized Inference is Not Hype, but a Key Track for AI to Break Through Centralized Monopoly

Foresight NewsPublished on 2026-06-23Last updated on 2026-06-23

Abstract

Decentralized Reasoning: Beyond the Hype, a Key to Breaking AI's Centralized Monopoly A future scenario where a powerful AI model is banned by a major government illustrates the core value proposition of decentralized AI: resistance to censorship. The core bet of decentralized inference networks is mitigating this risk, with other benefits like cost being secondary. The path is extremely difficult, involving four key challenges: 1. **Running Massive Models:** Distributing a single model across a decentralized GPU swarm requires sophisticated techniques like pipeline and speculative decoding to overcome crippling network latency, aiming for usable speeds (e.g., 30-40 tokens/second). 2. **Proving Model Integrity:** Verifying that a node runs the correct model is critical. Solutions range from cryptographically secure but slow ZKML to faster, economically-secure methods like statistical fingerprints, deterministic re-execution, or live-weight proofs, each involving trade-offs between integrity, latency, and cost. 3. **Ensuring Prompt Privacy:** Simply sharding a model does not protect user inputs from nodes. Robust solutions currently require trusted hardware (TEEs) or advanced cryptography (FHE), which are not yet widely deployed in consumer swarms. 4. **Building a Real Market:** Identifying the ideal customer is tough. Beyond speculative AI agents, the viable market currently consists of startups embedding AI and projects needing batch processing (e.g., synthetic data ge...


Written by:@KSimback

Compiled by:AididiaoJP


Scenario: What Happens if a Frontier Model Gets Banned?


It's October 2026, just four months from now. GLM-6 has just been released, surpassing Fable-5.1 (a neutered re-release of a banned model) on mainstream benchmarks and performing on par with Mythos. Unable to shut it down directly, the U.S. government issues a series of bans: prohibiting any provider from offering the GLM-6 model, updates, inference services, managing deployments, or technical support within the United States or to U.S. persons.


Amazon Bedrock, Google Vertex, and Microsoft Azure quickly announce compliance, refusing to host the model for enterprise clients. Major aggregation platforms like OpenRouter, Vercel, Cloudflare, TogetherAI also agree not to list it. GitHub scrubs all related traces from its platform. Hugging Face, as the last holdout, eventually removes all downloads for GLM-6-related models.


This scenario, while not the ideal outcome we hope for, is a perfectly plausible conclusion in a world where AI models advance at an exponential rate while policy-making crawls at a snail's pace.


This outcome, or the alternative where frontier AI remains monopolized by a handful of centralized entities, is precisely the fundamental reason why decentralized AI is so crucial.


This article is a companion piece to the author's previous introductory guide "Proof of Useful Work," adopting the same pragmatic approach, focusing on another key corner of crypto-AI (with some overlap between the two). The author delves into the problems decentralized AI must solve, the projects being tracked, due diligence frameworks, and personal judgments after in-depth research.


Why is Decentralized Inference Imperative?


Following the above scenario, you likely already thought of decentralized inference. If not, let's continue the thought experiment.


Once the GLM-6 model weights are released, copies will instantly proliferate across the internet—no ban or remedy can eliminate the tens of thousands of copies that now exist. These copies will be served on decentralized inference networks because there is no central authority there to act against them, and no single node whose ban would cripple the entire network.


Let me be clear: I'm not arguing whether this is good or bad. If a new open-weight model is released that could cause significant harm through misuse, I would never suggest sitting idly by. My point is: models will inevitably be obtained by those who wish to evade censorship.


This is the core premise of decentralized inference—it is a hedge against censorship, whether from governments or frontier labs. Other selling points, like cheaper tokens, verifiable inference, or privacy, are secondary. There's only one core bet: mitigating censorship risk.


Decentralized Inference is Truly Difficult, with Four Major Challenges


For most startups, solving one or two difficult problems is a significant challenge. Decentralized inference projects must simultaneously tackle four genuinely thorny issues. How each project addresses these is the key to separating substance from fluff, alpha from noise.


Challenge One: Running Models That Don't Fit on a Single Machine


The core idea is to build a GPU cluster (swarm), utilizing pipeline parallelism to serve the models users actually want. Simply put, each node holds only a small slice of the model weights and its own portion of the KV-cache, slices small enough to fit into consumer-grade 3090/4090 GPUs, or even higher-spec H100s. Combine enough nodes, and you can host large models like GLM.


Petals proved the feasibility of this approach as early as 2022 with a BitTorrent-style swarm running BLOOM-176B on consumer GPUs, but the speed was only about 1 token per second. Clearly, that speed was unusable, so subsequent innovation focused on making models run faster.


The truly fatal bottleneck is the network. Within a data center, GPUs communicate via NVLink at TB/s speeds; over the public internet, round-trip latency (RTT) can be tens of milliseconds. The decoding process is sequential, so a naive swarm pays the network round-trip cost for every token generated.


The most common solution is speculative decoding: a small, cheap draft model proposes K candidate tokens first, and the large sharded model verifies these K tokens in a single pipeline pass, then keeps the longest matching sequence. This way, one expensive network traversal yields several tokens, not just one.


Currently, real-world internet links achieve about 30-40 tokens per second—significant progress, but not yet fully validated at scale and at the speeds users truly need. This is a problem requiring real hardcore engineering prowess.


Note: Serving Inference is More Than Just Raw FLOPs


A common trap when comparing any swarm method to cloud-hosted models is focusing only on tokens per second, assuming that's the whole story.


But production-grade inference must get many things right, unrelated to raw compute power:


  • Balancing Time to First Token (TTFT) and inter-token latency
  • Prefill vs. decode phases (with completely opposite hardware needs)
  • Placement and transfer of the KV-cache
  • Streaming, continuous batching, and utilization under mixed loads
  • Long-context behavior, cold starts, and model warm-up
  • Node churn


Due Diligence Point: When a project cites throughput numbers, always ask what they're competing against. Centralized deployments using vLLM or SGLang (with disaggregated prefill and continuous batching) are the real benchmark, and this benchmark gets faster every quarter. "We achieved 30 tokens per second over the internet" sounds impressive, but may still lack competitiveness.


Challenge Two: Proving You Actually Got the Model You Paid For


If you don't trust the node, how do you know it actually ran the claimed model and didn't secretly swap it for a cheaper quantized version? Especially in networks involving mining tokens, it's easy for providers to "play games," ostensibly serving the actual model while running something cheaper.


Currently, there are five mainstream approaches:


  • ZKML: Zero-knowledge proofs for forward passes. Cryptographically bulletproof, but overhead is ~10,000x native. Generating one token for a Llama-3 model takes about 150 seconds. Impossible at frontier scale in the short term.
  • opML: Outputs come with a bond, opening a challenge window, with fraud proofs bisecting disputes to one step, re-run by an arbiter. Near-native speed, but finality requires waiting for the window, and there's a "verifier's dilemma" (if verification costs more than the value of catching cheating, no one verifies).
  • Deterministic re-execution: Make inference byte-for-byte reproducible; disputes only need to check if bytes are equal. Overhead less than 2%, backed by restaked ETH.
  • Statistical fingerprints: Cheaply hash or sample computations, catching most cheating most of the time. Not absolutely correct, but fast and suitable for heterogeneous GPUs, which a permissionless swarm needs.
  • Live-weight proofs: Directly sample the tensors actually residing in the service runtime, comparing them against a manifest of the approved model. Verifies "what was loaded," not "what was output." Overhead is only about 0.1%. This is a truly different approach.


The real-world trade-off is: you can only have two of these three simultaneously—cryptographic integrity, low latency, cost efficiency. ZKML gets integrity but sacrifices latency and cost; other methods get latency and cost but can only satisfy economic or statistical integrity.


Due Diligence Point: Ask which method a project uses, why, and what this trade-off means for the end product.


Challenge Three: How to Truly Keep Prompts Private?


Proving output correctness is a completely different problem from hiding input. In a sharded swarm, each node must decrypt activations to compute—encryption only protects the transmission line, not the node itself.


Transformer activations are actually very easy to reverse-engineer. A CCS 2025 paper showed over 90% accuracy in reconstructing input prompts from intermediate activations. The "Hidden No More" paper from ICML 2025 achieved near-perfect recovery and defeated the noise-and-permutation defense commonly used in swarms.


The only robust fix currently is a heavier sequence-sharded scheme, which no one in the consumer-GPU camp has truly launched yet, so this remains a largely unsolved problem.


A swarm can claim "no node holds the entire model," yet still leak every prompt to any node along the path. "No node holds the model" was never a privacy property.


What can genuinely provide privacy is hardware or mathematics, not network topology. TEEs (Trusted Execution Environments)—like Phala's solution on GPUs, Darkbloom's on Apple silicon, Venice's Pro mode—shift trust to a hardware root and provide attestation.


Fully Homomorphic Encryption (FHE) directly computes on ciphertexts, trusting nothing, but the cost for large models is currently unacceptable.


Due Diligence Point: A project either genuinely has one of these schemes, or it doesn't have privacy, no matter how the landing page is worded.


Important Reminder: "Private" does not equal "trustless." TEEs don't eliminate trust; they just shift it from node operators to hardware vendors, the firmware chain, attestation services, and the enclave implementation.


The real question is: Whose root of trust are you willing to accept? The chip maker? A set of restaked validators? A TEE network? Or pure mathematics?


Challenge Four: How to Build a Real Two-Sided Market?


The first three are technical challenges; the fourth is a business challenge.


For a decentralized inference network serving open-weight models, who is the ideal customer profile (ICP)?


Most ordinary consumers currently get tremendous value from subscription plans—lots of intelligence for $20-200 per month. These subsidized plans may disappear or become limited in the future, but it's very difficult to win over consumers today with pay-per-use inference APIs.


Enterprises won't be big buyers in the short term either. This may change in the future, but don't count on it soon.


That leaves two real user categories: 1) Startups and businesses embedding inference into their own product stacks, who naturally need API plans; and 2) Autonomous AI agents seeking their own inference capabilities.


The startup category is a growing market, a niche where significant revenue might be captured, but there's a clear near-term cap on value capture. AI agents as buyers are more speculative—someone still needs to pay for them in the short term.


Here's the dilemma: How do you aggregate meaningful supply of the models people actually want, when the target user group is unlikely to be big spenders on the network?


The only viable place currently is decentralized GPU providers. Projects like io.net, Akash, Render, Aethir, Nosana have been doing this for years, renting out entire GPUs or per-node entire model capacity to payers via token-coordinated markets. There is precedent.


Due Diligence Point: Ask about the project's ICP and how they plan to acquire target users while also keeping the supply side satisfied. If everything is built on speculative token appreciation expectations, that's a clear signal.


Who's Really Solving These Challenges? A Rundown of Major Projects


There are many projects currently categorized under "decentralized inference," but most don't address all four challenges equally; they have different focuses.


Petals: The absolute pioneer in decentralized inference. In 2022, proved BLOOM-176B could run BitTorrent-style on consumer GPUs. Conceptually significant but didn't solve incentives, privacy, or monetization. Any project that's essentially "Petals architecture + token" is likely larping.


Dolphin Network: The team behind the Dolphin series of uncensored open models (over 5 million downloads on Hugging Face). Origin stems from a real user demand first, then building the network. Technical highlights include live-weight proofs (0.1% overhead), layered with logprob fingerprints, software integrity checks, and account-level bonding. Has generated over 3.2 billion tokens, sustained bandwidth ~9400 t/s. A product-first, execution-strong representative.


Inference.net (formerly Kuzco): One of the most mature attempts at verifying models in the wild. Unique LOGIC mechanism uses logprob statistical tests to catch model swaps. Has been in production for ~18 months, fleet size in the thousands of GPUs. One of the few projects with both verification primitives and real operational history.


Morpheus: A decentralized routing and rewards layer, providing an OpenAI-compatible API + smart agent wrappers. Technical highlight is TEE-backed provider verification (Intel TDX + NVIDIA GPU attestation live). Needs to monitor MOR emissions and evidence of real external demand.


Chutes (Bittensor subnet 64): User-side is an OpenAI-compatible API, backend is Docker-packaged chutes deployed to Bittensor GPU miners. Has clear advantages in distribution and scale, but still lags in verification and privacy.


c0mpute: A new Solana-native project. Its Shard engine splits frontier models across consumer GPUs. Has public demos for GLM-5.2 744B and gpt-oss-120B (30-40 t/s). Technical artifacts are verifiable, but still extremely early (repo went live days ago, founder anonymous, token is pump.fun micro-cap).


Parallax (Gradient Network): A P2P distributed LLM inference framework supporting pipeline-parallel sharding across consumer GPUs and Apple Silicon, enabling individuals or small orgs to run "sovereign clusters." Strong institutional backing (Pantera and Multicoin led a $10M seed round), but privacy scheme unclear.


Darkbloom: Allows users to turn idle Mac compute into a private inference marketplace. Each Mac runs the full model, with privacy guaranteed via Secure Enclave attestation. Doesn't take the sharded swarm route; attestation stack is rigorous. Moved from research preview to public alpha; real traction worth watching (decentralization doesn't necessarily require tokenization).


MeshLLM: A permissionless P2P inference mesh introduced by Jack Dorsey and built by a team associated with Block. Uses Nostr for node discovery, no central servers, closer to BitTorrent than Bittensor. Protocol-first, no token, censorship-resistant.


Venice and Its Reseller Ecosystem: The exemplar for the entire field in searching for PMF and a viable business model. It is itself a centralized but privacy-tiered consumer proxy, having effectively solved some challenges. A sub-ecosystem of resellers like UsePod, AntSeed, Surplus Intelligence has formed around it, primarily doing demand aggregation and settlement, not directly providing decentralized compute.


The Battleground for Decentralized Inference


Cost advantage only exists when separating latency and throughput. They are two different products; decentralization is a tax for one and a feature for the other.


Scenarios where Centralized Clearly Wins (decentralization is a tax): ChatGPT-style interactive chat, real-time coding agents, low-latency voice, high-frequency tool calls, enterprise-grade strict p95 latency SLAs, competitively low-latency service for dense frontier models.


Scenarios where Decentralized Might Win (supply aggregation advantage): Synthetic data generation, offline evaluation, batch embeddings, batch RAG, long-term agent research tasks, image/video generation queues, non-urgent open-model inference (where marginal cost of idle hardware approaches zero).


Simple Framework: When latency matters, decentralization is a tax; when throughput matters, decentralization can be a supply aggregation advantage.


The Hidden Long-Term Value: Data Loop


Decentralized inference networks can also collect vast amounts of valuable data—synthetic training data, preference data, agent traces, evaluation outputs, fine-tuning data, RL environments, tool usage traces, etc. This data can feed back into decentralized training networks (like Nous Psyche, Prime Intellect, Gensyn-style projects), producing updated open-weight models that flow back into inference networks.


In the long run, this isn't a separate bet on "decentralized training" or "decentralized inference," but a closed loop: Inference generates traces → Traces become training data → Training updates models → Updated models flow back to inference.


The best projects will treat this loop as a core strategy, and training and inference projects will further merge in the future.


Practical Due Diligence Checklist: Just Answer These Seven Questions


  • Is it truly decentralized? Specifically at which layers? (Many just slap the label on because they have a token)
  • Can you trust the output came from the model you paid for? (Deterministic, proofs, fingerprints, or nothing?)
  • After deducting token and coordination overhead, is it genuinely cheaper than centralized? (In production, not in theory)
  • Are prompts truly hidden from operators? (Only TEE/FHE count; mere sharding does not)
  • Can the system run reliably when nodes are unreliable and scattered across the internet?
  • Is anyone actually paying, and for something they can't get cheaper centrally?
  • Does the team possess genuine AI technical capability? (The most important one)


Extra Tip: Be wary of "elegant technical solutions" without credible distribution plans.


My Final Judgment


I'm generally bearish on categories that only appeal to crypto-natives (TAM seems limited in my view). I prefer to see projects that also appeal to non-crypto users, hiding the crypto mechanisms in the background.


Decentralized inference is one of the few tracks in crypto with genuine breakout potential—everyone needs inference, it can be served like traditional providers, even seamlessly through platforms like OpenRouter. The key is cost, performance, and privacy.


Advice: Support projects that can clearly articulate which layer they've decentralized and who their buyer is. Avoid projects that just use "decentralized AI" as a slogan, followed by a coin.


Disclosure: The original author holds tokens in some projects mentioned. They were not influenced or compensated by any project, and judgments are personal opinions.

Trending Cryptos

Related Questions

QAccording to the article, what is the core value proposition of decentralized inference, and why is it considered inevitable?

AThe core value proposition of decentralized inference is mitigating censorship risk, whether from governments or frontier AI labs. It is considered inevitable because once an open-weight model is released, copies of it will instantly proliferate across the internet. These copies can then be served on decentralized networks where there is no central authority to enforce takedowns, making the model effectively un-bannable.

QWhat are the four major challenges faced by decentralized inference projects as outlined in the article?

AThe four major challenges are: 1) Running models that are too large for a single machine by coordinating a GPU swarm, which faces network latency bottlenecks. 2) Verifying that the requested model was actually run (proof of execution) and not a cheaper substitute. 3) Keeping user prompts truly private from node operators. 4) Building a real two-sided marketplace with a clear Ideal Customer Profile (ICP) who will pay for the service.

QExplain the main difference in the performance trade-offs for decentralized inference versus centralized inference, as framed by the article's "battleground" section.

AThe article frames the difference using a latency vs. throughput trade-off. Decentralized inference acts as a 'tax' in scenarios where low latency is critical (e.g., interactive chat, real-time coding). Centralized providers clearly win here. However, decentralization can be a 'supply aggregation advantage' for high-throughput, non-latency-sensitive tasks (e.g., synthetic data generation, batch embeddings, long-running research tasks), where it can potentially offer cost benefits by leveraging idle hardware.

QWhat is the 'hidden long-term value' mentioned for decentralized inference networks, and how does it create a strategic loop?

AThe hidden long-term value is the data flywheel or feedback loop. Decentralized inference networks can collect valuable data like synthetic training data, preference data, agent traces, and evaluation outputs. This data can then be used to train and improve new open-weight models in decentralized training networks. These updated models flow back into the decentralized inference networks, creating a closed-loop system that continuously improves itself.

QBased on the article's due diligence checklist, what are two critical technical questions to ask when evaluating a decentralized inference project?

ATwo critical technical questions from the checklist are: 1) 'Can you trust the output came from the model you paid for?' This probes the project's solution for proof of execution (e.g., deterministic re-execution, statistical fingerprints, live-weight proofs). 2) 'Is the prompt truly hidden from the operator?' This assesses the privacy guarantees, distinguishing between mere model sharding (not private) and robust solutions like TEEs (Trusted Execution Environments) or FHE (Fully Homomorphic Encryption).

Related Reads

Second Half of U.S. Crypto Policy: The Clarity Act Aims for 60 Votes, CFTC's "One-Person Commission" Becomes Biggest Variable

In a pivotal year for US crypto policy, the "CLARITY Act" is advancing in the Senate but faces a high hurdle, needing 60 votes to pass. Key challenges include bridging partisan divides on ethics and swaying undecided Republican senators within a tight legislative calendar of only about 40 working days. The policy "second half" involves intense negotiations on a broader framework for Web3 and DeFi, including crypto tax reforms and the Blockchain Regulatory Certainty Act. A significant uncertainty is the understaffed CFTC, operating with four commissioner vacancies, which complicates regulatory clarity. Meanwhile, the departure of key "crypto champions"—SEC Commissioner Hester Peirce and Senator Cynthia Lummis—will impact ongoing policy efforts. Industry experts are cautiously optimistic but realistic. Sara K. Weed notes that while progress is being made, CLARITY is unlikely to pass this Congress, pushing agencies like the SEC and CFTC to provide more guidance. Sulolit Mukherjee suggests meaningful crypto tax legislation is more likely to be attached to larger must-pass bills. Rashan Colbert discusses the jurisdictional debate over prediction markets, emphasizing the need for a regulatory framework that fosters their development as financial tools rather than treating them broadly as gambling. The clock is ticking, but opportunities remain for substantive progress through continued bipartisan dialogue and pragmatic efforts.

marsbit6m ago

Second Half of U.S. Crypto Policy: The Clarity Act Aims for 60 Votes, CFTC's "One-Person Commission" Becomes Biggest Variable

marsbit6m ago

Dan Koe's New Essay: Escaping the Fate of the Wage Slave, How to Survive the AI Replacement Wave?

Dan Koe argues that the true threat in the AI era isn't technology itself, but a reliance on others for one's livelihood and happiness. The core problem is "wage slavery"—spending life on unfulfilling work. To survive and thrive, one must escape this by building their own enterprise. The key is developing five elements: Agency (initiative), Taste (discernment), Persuasion, Persistence, and Iteration. These boil down to problem-solving skills and experiential knowledge, which cannot be learned passively but only through doing your own projects. The solution is to become "unemployable" by shifting your identity. This requires: 1) Radically changing your environment to force growth, 2) Choosing a medium (like content creation) that provides real feedback through trial and error, and 3) Mastering either code or, preferably, media (content). Content creation is more valuable because its subjective nature and need for human perspective create a durable advantage over generic AI output. To start, define your life's work by answering foundational questions about your innate knowledge, unique abilities, and contrarian beliefs. Then, immediately act by publishing your first piece of content. The cycle of creating, receiving feedback, and iterating is the essential path to developing the skills needed for an independent, meaningful career and financial resilience.

marsbit42m ago

Dan Koe's New Essay: Escaping the Fate of the Wage Slave, How to Survive the AI Replacement Wave?

marsbit42m ago

Research Report Analysis: Morgan Stanley Details SanDisk SNDK, The Truth About Cloud Data Center Pricing Power and AI Inference Benefits

Morgan Stanley raised its price target for SanDisk (SNDK) from $1100 to $1750 on June 22, maintaining an Overweight rating. The upgrade is driven by AI inference demand reshaping the NAND market, particularly for KV Cache and context window storage in cloud data centers. These cloud clients exhibit price inelasticity and sign long-term contracts, granting SanDisk significant pricing power. SanDisk's New Business Model (NBM) agreements, covering over one-third of FY27 bit shipments with 3-5 year terms and fixed price/price collar structures, are crucial. They are projected to sustain gross margins around 80% even at floor prices, providing a buffer against cyclical downturns. Morgan Stanley forecasts gross margins to surge from 30.3% in FY25 to 86.7% in FY27e. With NAND supply expected to remain tight into 2026/2027 and cloud/data centers becoming the largest end-market, SanDisk holds supply-side pricing power. The company targets 15-19% bit growth via technology transitions, not capacity expansion. Revenue is projected to grow ~6.6x from FY25 to FY27, with EPS rising from $2.74 to $14.73, driven by high-margin cloud business. Key upside catalysts include faster enterprise SSD adoption and edge AI growth. Downside risks involve slower industry growth, competitor capex increases, market share loss, and competition from Chinese players like YMTC. The investment thesis rests on AI-driven structural demand, NBM's margin protection, and sustained supply tightness. The $1750 target implies ~28x FY27e P/E.

marsbit1h ago

Research Report Analysis: Morgan Stanley Details SanDisk SNDK, The Truth About Cloud Data Center Pricing Power and AI Inference Benefits

marsbit1h ago

A Threefold Performance Leap! NEAR Achieves 200ms Physical Block Time Limit with SPICE

NEAR's core development team, Near One, has announced its next major protocol evolution: SPICE (Separation of Consensus and Execution). Currently in development, SPICE represents the most significant upgrade before the full implementation of Nightshade 3.0. Its core innovation is decoupling the consensus layer, responsible for ordering transactions, from the execution layer, which processes them. This allows the consensus layer to run at full speed without waiting for transaction execution to complete. Once deployed, SPICE is projected to triple NEAR's block production speed, achieving a 200ms block time, which is considered the physical limit due to the speed of light and network latency. This leap will dramatically reduce transaction latency and finality, with transactions confirming in roughly 0.4 seconds—faster than a typical card payment. The upgrade also enables more complex, long-running transactions and significantly improves user experience for applications like NEAR Intents and near.com. Beyond raw speed, SPICE enhances network scalability and security. It enables deeper parallelism, efficiently distributing workload across shards and improving resource utilization. The simpler block structure and lighter contracts also facilitate formal verification and security auditing. Furthermore, SPICE lays the critical groundwork for future Nightshade 3.0 features, most notably atomic cross-shard transactions, which would simplify complex contract logic and eliminate development hurdles caused by asynchronous execution. The Near One team is actively developing SPICE, targeting deployment in the coming months.

Foresight News2h ago

A Threefold Performance Leap! NEAR Achieves 200ms Physical Block Time Limit with SPICE

Foresight News2h ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片