Written by:@KSimback
Compiled by:AididiaoJP
Scenario: What Happens if a Frontier Model Gets Banned?
It's October 2026, just four months from now. GLM-6 has just been released, surpassing Fable-5.1 (a neutered re-release of a banned model) on mainstream benchmarks and performing on par with Mythos. Unable to shut it down directly, the U.S. government issues a series of bans: prohibiting any provider from offering the GLM-6 model, updates, inference services, managing deployments, or technical support within the United States or to U.S. persons.
Amazon Bedrock, Google Vertex, and Microsoft Azure quickly announce compliance, refusing to host the model for enterprise clients. Major aggregation platforms like OpenRouter, Vercel, Cloudflare, TogetherAI also agree not to list it. GitHub scrubs all related traces from its platform. Hugging Face, as the last holdout, eventually removes all downloads for GLM-6-related models.
This scenario, while not the ideal outcome we hope for, is a perfectly plausible conclusion in a world where AI models advance at an exponential rate while policy-making crawls at a snail's pace.
This outcome, or the alternative where frontier AI remains monopolized by a handful of centralized entities, is precisely the fundamental reason why decentralized AI is so crucial.
This article is a companion piece to the author's previous introductory guide "Proof of Useful Work," adopting the same pragmatic approach, focusing on another key corner of crypto-AI (with some overlap between the two). The author delves into the problems decentralized AI must solve, the projects being tracked, due diligence frameworks, and personal judgments after in-depth research.
Why is Decentralized Inference Imperative?
Following the above scenario, you likely already thought of decentralized inference. If not, let's continue the thought experiment.
Once the GLM-6 model weights are released, copies will instantly proliferate across the internet—no ban or remedy can eliminate the tens of thousands of copies that now exist. These copies will be served on decentralized inference networks because there is no central authority there to act against them, and no single node whose ban would cripple the entire network.
Let me be clear: I'm not arguing whether this is good or bad. If a new open-weight model is released that could cause significant harm through misuse, I would never suggest sitting idly by. My point is: models will inevitably be obtained by those who wish to evade censorship.
This is the core premise of decentralized inference—it is a hedge against censorship, whether from governments or frontier labs. Other selling points, like cheaper tokens, verifiable inference, or privacy, are secondary. There's only one core bet: mitigating censorship risk.
Decentralized Inference is Truly Difficult, with Four Major Challenges
For most startups, solving one or two difficult problems is a significant challenge. Decentralized inference projects must simultaneously tackle four genuinely thorny issues. How each project addresses these is the key to separating substance from fluff, alpha from noise.
Challenge One: Running Models That Don't Fit on a Single Machine
The core idea is to build a GPU cluster (swarm), utilizing pipeline parallelism to serve the models users actually want. Simply put, each node holds only a small slice of the model weights and its own portion of the KV-cache, slices small enough to fit into consumer-grade 3090/4090 GPUs, or even higher-spec H100s. Combine enough nodes, and you can host large models like GLM.
Petals proved the feasibility of this approach as early as 2022 with a BitTorrent-style swarm running BLOOM-176B on consumer GPUs, but the speed was only about 1 token per second. Clearly, that speed was unusable, so subsequent innovation focused on making models run faster.
The truly fatal bottleneck is the network. Within a data center, GPUs communicate via NVLink at TB/s speeds; over the public internet, round-trip latency (RTT) can be tens of milliseconds. The decoding process is sequential, so a naive swarm pays the network round-trip cost for every token generated.
The most common solution is speculative decoding: a small, cheap draft model proposes K candidate tokens first, and the large sharded model verifies these K tokens in a single pipeline pass, then keeps the longest matching sequence. This way, one expensive network traversal yields several tokens, not just one.
Currently, real-world internet links achieve about 30-40 tokens per second—significant progress, but not yet fully validated at scale and at the speeds users truly need. This is a problem requiring real hardcore engineering prowess.
Note: Serving Inference is More Than Just Raw FLOPs
A common trap when comparing any swarm method to cloud-hosted models is focusing only on tokens per second, assuming that's the whole story.
But production-grade inference must get many things right, unrelated to raw compute power:
- Balancing Time to First Token (TTFT) and inter-token latency
- Prefill vs. decode phases (with completely opposite hardware needs)
- Placement and transfer of the KV-cache
- Streaming, continuous batching, and utilization under mixed loads
- Long-context behavior, cold starts, and model warm-up
- Node churn
Due Diligence Point: When a project cites throughput numbers, always ask what they're competing against. Centralized deployments using vLLM or SGLang (with disaggregated prefill and continuous batching) are the real benchmark, and this benchmark gets faster every quarter. "We achieved 30 tokens per second over the internet" sounds impressive, but may still lack competitiveness.
Challenge Two: Proving You Actually Got the Model You Paid For
If you don't trust the node, how do you know it actually ran the claimed model and didn't secretly swap it for a cheaper quantized version? Especially in networks involving mining tokens, it's easy for providers to "play games," ostensibly serving the actual model while running something cheaper.
Currently, there are five mainstream approaches:
- ZKML: Zero-knowledge proofs for forward passes. Cryptographically bulletproof, but overhead is ~10,000x native. Generating one token for a Llama-3 model takes about 150 seconds. Impossible at frontier scale in the short term.
- opML: Outputs come with a bond, opening a challenge window, with fraud proofs bisecting disputes to one step, re-run by an arbiter. Near-native speed, but finality requires waiting for the window, and there's a "verifier's dilemma" (if verification costs more than the value of catching cheating, no one verifies).
- Deterministic re-execution: Make inference byte-for-byte reproducible; disputes only need to check if bytes are equal. Overhead less than 2%, backed by restaked ETH.
- Statistical fingerprints: Cheaply hash or sample computations, catching most cheating most of the time. Not absolutely correct, but fast and suitable for heterogeneous GPUs, which a permissionless swarm needs.
- Live-weight proofs: Directly sample the tensors actually residing in the service runtime, comparing them against a manifest of the approved model. Verifies "what was loaded," not "what was output." Overhead is only about 0.1%. This is a truly different approach.
The real-world trade-off is: you can only have two of these three simultaneously—cryptographic integrity, low latency, cost efficiency. ZKML gets integrity but sacrifices latency and cost; other methods get latency and cost but can only satisfy economic or statistical integrity.
Due Diligence Point: Ask which method a project uses, why, and what this trade-off means for the end product.
Challenge Three: How to Truly Keep Prompts Private?
Proving output correctness is a completely different problem from hiding input. In a sharded swarm, each node must decrypt activations to compute—encryption only protects the transmission line, not the node itself.
Transformer activations are actually very easy to reverse-engineer. A CCS 2025 paper showed over 90% accuracy in reconstructing input prompts from intermediate activations. The "Hidden No More" paper from ICML 2025 achieved near-perfect recovery and defeated the noise-and-permutation defense commonly used in swarms.
The only robust fix currently is a heavier sequence-sharded scheme, which no one in the consumer-GPU camp has truly launched yet, so this remains a largely unsolved problem.
A swarm can claim "no node holds the entire model," yet still leak every prompt to any node along the path. "No node holds the model" was never a privacy property.
What can genuinely provide privacy is hardware or mathematics, not network topology. TEEs (Trusted Execution Environments)—like Phala's solution on GPUs, Darkbloom's on Apple silicon, Venice's Pro mode—shift trust to a hardware root and provide attestation.
Fully Homomorphic Encryption (FHE) directly computes on ciphertexts, trusting nothing, but the cost for large models is currently unacceptable.
Due Diligence Point: A project either genuinely has one of these schemes, or it doesn't have privacy, no matter how the landing page is worded.
Important Reminder: "Private" does not equal "trustless." TEEs don't eliminate trust; they just shift it from node operators to hardware vendors, the firmware chain, attestation services, and the enclave implementation.
The real question is: Whose root of trust are you willing to accept? The chip maker? A set of restaked validators? A TEE network? Or pure mathematics?
Challenge Four: How to Build a Real Two-Sided Market?
The first three are technical challenges; the fourth is a business challenge.
For a decentralized inference network serving open-weight models, who is the ideal customer profile (ICP)?
Most ordinary consumers currently get tremendous value from subscription plans—lots of intelligence for $20-200 per month. These subsidized plans may disappear or become limited in the future, but it's very difficult to win over consumers today with pay-per-use inference APIs.
Enterprises won't be big buyers in the short term either. This may change in the future, but don't count on it soon.
That leaves two real user categories: 1) Startups and businesses embedding inference into their own product stacks, who naturally need API plans; and 2) Autonomous AI agents seeking their own inference capabilities.
The startup category is a growing market, a niche where significant revenue might be captured, but there's a clear near-term cap on value capture. AI agents as buyers are more speculative—someone still needs to pay for them in the short term.
Here's the dilemma: How do you aggregate meaningful supply of the models people actually want, when the target user group is unlikely to be big spenders on the network?
The only viable place currently is decentralized GPU providers. Projects like io.net, Akash, Render, Aethir, Nosana have been doing this for years, renting out entire GPUs or per-node entire model capacity to payers via token-coordinated markets. There is precedent.
Due Diligence Point: Ask about the project's ICP and how they plan to acquire target users while also keeping the supply side satisfied. If everything is built on speculative token appreciation expectations, that's a clear signal.
Who's Really Solving These Challenges? A Rundown of Major Projects
There are many projects currently categorized under "decentralized inference," but most don't address all four challenges equally; they have different focuses.
Petals: The absolute pioneer in decentralized inference. In 2022, proved BLOOM-176B could run BitTorrent-style on consumer GPUs. Conceptually significant but didn't solve incentives, privacy, or monetization. Any project that's essentially "Petals architecture + token" is likely larping.
Dolphin Network: The team behind the Dolphin series of uncensored open models (over 5 million downloads on Hugging Face). Origin stems from a real user demand first, then building the network. Technical highlights include live-weight proofs (0.1% overhead), layered with logprob fingerprints, software integrity checks, and account-level bonding. Has generated over 3.2 billion tokens, sustained bandwidth ~9400 t/s. A product-first, execution-strong representative.
Inference.net (formerly Kuzco): One of the most mature attempts at verifying models in the wild. Unique LOGIC mechanism uses logprob statistical tests to catch model swaps. Has been in production for ~18 months, fleet size in the thousands of GPUs. One of the few projects with both verification primitives and real operational history.
Morpheus: A decentralized routing and rewards layer, providing an OpenAI-compatible API + smart agent wrappers. Technical highlight is TEE-backed provider verification (Intel TDX + NVIDIA GPU attestation live). Needs to monitor MOR emissions and evidence of real external demand.
Chutes (Bittensor subnet 64): User-side is an OpenAI-compatible API, backend is Docker-packaged chutes deployed to Bittensor GPU miners. Has clear advantages in distribution and scale, but still lags in verification and privacy.
c0mpute: A new Solana-native project. Its Shard engine splits frontier models across consumer GPUs. Has public demos for GLM-5.2 744B and gpt-oss-120B (30-40 t/s). Technical artifacts are verifiable, but still extremely early (repo went live days ago, founder anonymous, token is pump.fun micro-cap).
Parallax (Gradient Network): A P2P distributed LLM inference framework supporting pipeline-parallel sharding across consumer GPUs and Apple Silicon, enabling individuals or small orgs to run "sovereign clusters." Strong institutional backing (Pantera and Multicoin led a $10M seed round), but privacy scheme unclear.
Darkbloom: Allows users to turn idle Mac compute into a private inference marketplace. Each Mac runs the full model, with privacy guaranteed via Secure Enclave attestation. Doesn't take the sharded swarm route; attestation stack is rigorous. Moved from research preview to public alpha; real traction worth watching (decentralization doesn't necessarily require tokenization).
MeshLLM: A permissionless P2P inference mesh introduced by Jack Dorsey and built by a team associated with Block. Uses Nostr for node discovery, no central servers, closer to BitTorrent than Bittensor. Protocol-first, no token, censorship-resistant.
Venice and Its Reseller Ecosystem: The exemplar for the entire field in searching for PMF and a viable business model. It is itself a centralized but privacy-tiered consumer proxy, having effectively solved some challenges. A sub-ecosystem of resellers like UsePod, AntSeed, Surplus Intelligence has formed around it, primarily doing demand aggregation and settlement, not directly providing decentralized compute.
The Battleground for Decentralized Inference
Cost advantage only exists when separating latency and throughput. They are two different products; decentralization is a tax for one and a feature for the other.
Scenarios where Centralized Clearly Wins (decentralization is a tax): ChatGPT-style interactive chat, real-time coding agents, low-latency voice, high-frequency tool calls, enterprise-grade strict p95 latency SLAs, competitively low-latency service for dense frontier models.
Scenarios where Decentralized Might Win (supply aggregation advantage): Synthetic data generation, offline evaluation, batch embeddings, batch RAG, long-term agent research tasks, image/video generation queues, non-urgent open-model inference (where marginal cost of idle hardware approaches zero).
Simple Framework: When latency matters, decentralization is a tax; when throughput matters, decentralization can be a supply aggregation advantage.
The Hidden Long-Term Value: Data Loop
Decentralized inference networks can also collect vast amounts of valuable data—synthetic training data, preference data, agent traces, evaluation outputs, fine-tuning data, RL environments, tool usage traces, etc. This data can feed back into decentralized training networks (like Nous Psyche, Prime Intellect, Gensyn-style projects), producing updated open-weight models that flow back into inference networks.
In the long run, this isn't a separate bet on "decentralized training" or "decentralized inference," but a closed loop: Inference generates traces → Traces become training data → Training updates models → Updated models flow back to inference.
The best projects will treat this loop as a core strategy, and training and inference projects will further merge in the future.
Practical Due Diligence Checklist: Just Answer These Seven Questions
- Is it truly decentralized? Specifically at which layers? (Many just slap the label on because they have a token)
- Can you trust the output came from the model you paid for? (Deterministic, proofs, fingerprints, or nothing?)
- After deducting token and coordination overhead, is it genuinely cheaper than centralized? (In production, not in theory)
- Are prompts truly hidden from operators? (Only TEE/FHE count; mere sharding does not)
- Can the system run reliably when nodes are unreliable and scattered across the internet?
- Is anyone actually paying, and for something they can't get cheaper centrally?
- Does the team possess genuine AI technical capability? (The most important one)
Extra Tip: Be wary of "elegant technical solutions" without credible distribution plans.
My Final Judgment
I'm generally bearish on categories that only appeal to crypto-natives (TAM seems limited in my view). I prefer to see projects that also appeal to non-crypto users, hiding the crypto mechanisms in the background.
Decentralized inference is one of the few tracks in crypto with genuine breakout potential—everyone needs inference, it can be served like traditional providers, even seamlessly through platforms like OpenRouter. The key is cost, performance, and privacy.
Advice: Support projects that can clearly articulate which layer they've decentralized and who their buyer is. Avoid projects that just use "decentralized AI" as a slogan, followed by a coin.
Disclosure: The original author holds tokens in some projects mentioned. They were not influenced or compensated by any project, and judgments are personal opinions.






