AI Investors' 2026 Anxiety: When Models Devour Everything, What Moat Is Left for Startups?

marsbitPublicado a 2026-06-11Actualizado a 2026-06-11

Resumen

In 2026, a wave of investor anxiety questions the defensibility of AI startups as models improve, fearing that most companies are just "thin wrappers" destined to be absorbed by foundation models or chipmakers. The author argues against this despair, positing that true moats lie not in benchmark performance but in areas models cannot easily reach. The logic of despair is that if models excel at all measurable tasks, only compute and cutting-edge model weights hold lasting value. However, the essay contends that the most valuable work is inherently "untrainable." Benchmarks measure what can be measured and thus optimized for, but real-world correctness often resides in private, complex systems. Examples include legacy codebases, intricate legal transactions, or hospital workflows. This kind of correctness is proprietary, costly to establish, and cannot be validated quickly—it requires time and trust within an organization. As models commodify visible, measurable tasks from both above (labs absorbing scaffolding) and below (saturation by cheaper models), value shifts to "untrainable ground." This encompasses work where correctness is a private truth, locked behind integration barriers, licenses, liability frameworks, and entrenched user habits. Trust and adoption are slow, human-centric processes that smarter models cannot accelerate. Successful companies defend their position by embedding deeply into client operations, owning the definition of "good" within a specific domai...

Author: Sarah Guo

Translation: Deep Tide TechFlow

Deep Tide Introduction: When large models begin to crush humans on all leaderboards, investors are falling into a kind of despair: what is worth investing in besides Anthropic and Nvidia? This top Silicon Valley investor explains with data and case studies that the real moat isn't on the leaderboard—it's hidden in places that cannot be measured by benchmarks.

Mid-2026, the investor version of AI psychosis is despair: There's nothing worth investing in, we should just put all our money in Anthropic and Nvidia and go home.

I've never felt this way. I'm already convinced the model is several minor versions smarter than me, I'm happy to buy Anthropic and Nvidia at market price, and all my smartest friends are fairly convinced self-improvement will succeed soon—but I still don't feel this despair.

This despair is not stupid. The logic is this: if models keep getting better at everything, then every company built on them is just a thin layer of wrapping, waiting to be absorbed, and the only value that survives is compute power and frontier weights.

Take software as an example, the case despair theorists rely on most. When Devin launched in 2024, it could only solve 13% of tasks on standard software benchmarks, basically ignored. A year and a half later, the best agents score in the 80s, and they're doing real work inside Goldman Sachs and the U.S. Army. Almost everyone draws the same wrong lesson: the model ate software engineering. But as the model devours the most measurable parts of software engineering, we're rediscovering what many teams have long known—engineering has always resisted measurement, and the easiest part to measure may not be the only important part.

MIT's Mert Demirer and his collaborators finally put numbers to it: among over 100,000 developers, the latest coding agent increased the amount of code written by about 180%, while the amount of code actually shipped increased by about 30%. Writing code got cheaper. The remaining parts still have to go through people, and they matter. Of course, the net impact is still staggering.

Benchmarks are things you can measure, and what you can measure is what you can train for. Therefore, coding agents matured first: compilers are free verifiers, test suites are free verifiers, when the answer checks itself for free, you can grind against the check until you beat it. But passing the tests never tells you whether that change is the right one for a decade-old codebase with three undocumented modules whose reason for existing, and a deployment pipeline held together by a cron job nobody will admit to writing.

That kind of correctness cannot be read from a leaderboard, and in fact cannot be read from anything. You learn by running in the real world long enough to discover whether such a complex system works, and smarter models don't make the world run faster. Nobody runs unit tests on something Google-scale and trusts the green check; you trust it because it has weathered years of real load. Such correctness isn't just private, it's the slow kind of moat that capital can't bulldoze. Even optimists admit the clock cannot be skipped: Noam Brown, pioneer of reasoning models at OpenAI, recently wrote that the only reliable way to evaluate an agent over a year-long timescale might just be... to run it for a year.

As Gabe Pereyra says, true automation is not just the model getting better. It's the product, model, workflow, and company all moving together, and three of those four move at the speed of organizations.

The people-moving part is what benchmarks don't touch: getting a skeptical partner to change how she handles matters, keeping the team together through the rebuild. That's why when we hire a CEO, the ability to handle people matters at least as much as analytical skill, and smarter models won't change that weighting. Feedback is fuzzy, the timescale is years, trust is personal. Every company I know gave every engineer access to frontier coding models, but none of them changed their engineering org anywhere near that speed. Adoption took a quarter, what a magical token growth quarter that was! But the rebuild is taking years.

What's visible is what's leaving. Valuable work is structurally invisible: anything you can put on a leaderboard, you can train against, so anything measurable is already on its way to commoditization. This process takes time and never fully completes, but the direction never reverses. In the monetary terms of my friend Matt MacInnis at Rippling: tokens spent answering generic questions are almost worthless because anyone's model can answer it, while tokens spent reasoning over your company's data are worth much more because they do what you actually want, not just what looks plausible.

Visible work gets eaten from two sides. From below, task saturation: once a job can be cheaply checked, buyers stop asking which model did it and start asking how much it costs, and the job falls to the cheapest open-source or distilled model that week. Wherever they can have impact, margins matter in the end. From above, labs are trying to get the model to devour its own scaffolding. Retrieval, routing between cheap and expensive calls, tool use, even reasoning strategies, all the apparatus that used to wrap the model gets pulled into the weights until the wrapper is the model. That's frontier absorption. Margin pressure cuts the other way too: a general-purpose agent has to be ready for anything, which is expensive, while a focused app can tune a workflow until it runs on a fraction of the token spend, and unlike the lab selling those tokens, it keeps the spread.

So, we can ask two things about any type of work. Is its correctness private and expensive to build, the kind of truth that exists only inside someone's data? Is it isolated, locked inside systems you cannot enter? Contrast these against how saturated the task is, and you get a 2x2 matrix. Saturated work with public answers is commodity tokens, owned by open-source models. Frontier work with public answers, where coding benchmarks live, is where labs win, because when evaluation is free, owning it costs nothing. The prize is in the last corner, the untrainable one: frontier work whose correctness exists only in private domains. You can see this in the inference clouds hosting AI-native pioneers, where the vast majority of tokens are generated by bespoke models, not general-purpose open-source ones.

The walls into that last corner vary in height. A single developer's toy codebase is portable and standardized, so the climb is short. A bank's production system is neither, and you don't get root access by being 2% smarter on SWE-Bench Verified.

Capability eats many things, but a better model does not make a private ground truth public. It doesn't hold the license, sign the liability, or own the firm's documents, and it cannot be the party sued when the answer is wrong. Intelligence isn't the bottleneck here. Licensing is, and so is liability. You can imagine a model far smarter than anyone, and it still must be allowed through the door, and someone still must sign their name to what it does.

That door has a lock and a bolt. The lock is context: you only get to validate whether the AI did useful things after being trusted inside the system, after security reviews, integration, the contract where you sign your name to the result. The bolt is the user. Most doctors in the U.S. now open OpenEvidence every day, and no amount of compute can buy that. A lab could train a perfect medical model tomorrow and still fail to get into a doctor's habit, or into UCSF's decision flow, because trust is built slowly, on relationships, requiring the user's acquiescence, not erasing it with gradient descent.

This too is work. An app earns its place in the untrainable corner by doing unglamorous work: arranging the company's private reality so the model can act on it, giving the model tools to act, working with the customer to change the reality of their employees. A company that brings the translation is hard to copy—and the translation never ends. Integration and maintenance last as long as the relationship, won by teams that put domain-specialist engineers and tools next to the customer.

For example, at a top white-shoe law firm, the M&A practice alone runs nearly a thousand deals a year. For confidentiality and many other reasons, you can't have hundreds of associates each downloading client files to their desktop and asking a general-purpose agent to sift through them, and even if you could, what you'd learn would be piecemeal, one associate's correction at a time, missing how the whole deal flows. The important signal exists at the deal level, and a deal has a shape: for M&A, it's NDA, term sheet, due diligence, purchase agreement, ancillary documents, closing checklist; for IP litigation, it's motions, discovery, prior art, more motions. Each practice area has its own, and lawyers and tools are not interchangeable across them. And the problem the law firm actually solves sits a level above all this: running every practice area in parallel, like a top partner running hundreds of matters at once while onboarding new ones and training associates. Transforming such a law firm is not a single task you can write an eval for. It requires an operator to do what the analytics firm does, with incredibly fuzzy goals, incomplete feedback, long timescales, in an environment that won't stay still.

Unfortunately, invisible value is also hard to sell, for the same reason it's hard to commoditize: companies can't tell from the outside whether AI will transform their ops, just as a benchmark can't tell. So the strongest enterprises stop trying to prove it from the outside and go inside, pricing on outcome. Sierra charges when its agent resolves a customer issue, not when it kicks it to a human, so price becomes the evaluation, which only works because Sierra owns the definition of "resolved." Cognition's Devin does the same move in software, offering a "performance guarantee," which you can only give for outcomes in systems you are trusted inside.

Even serving tokens, the layer everyone loves to call a pure commodity, doesn't act like one. The best AI-native companies concentrate their serving on one or two providers (Baseten or Fireworks) because per-token cost commoditizes on schedule, while reliability at real traffic and guaranteed access to scarce compute do not. Where you serve is a separate choice from which models you use. Price is the only part of inference that acts like a commodity.

A common objection raised is that the lab is your supplier—why wouldn't it run its own first-party product below cost to bleed you dry, or revoke your API access and take the market itself? This is the real version of despair theory, and it only works if the model layer is a single-player game. It's obviously not—it looks more like a deathmatch between three and a half parties, with a pack of international players six months behind on training, and a G League five times the size of last year's. Customers want competition among suppliers, and labs want market share more than they want any single app dead.

You can see this in markets where labs go head-to-head. In consumer chat, the best model never simply wins. ChatGPT held the lead for years in real competition, and the share it's losing now is going to Gemini, on the strength of Android and search, not a better model. Anthropic, currently rated by prediction markets (and internet vibes) as having the best model, is barely a factor in consumer chat but built its business in enterprise and coding instead. If a better model cannot take users from a competitor in the most core application, it won't make it through a hospital's records or a bank's liability by integration. The public's choice today is not based on coding alone. If the frontier stays crowded, the layer above it will be valuable.

If work can't be scored from the outside, someone inside has to decide what even counts as a good answer, and that decision is the whole game. Enough of those decisions, written down, becomes a benchmark. Harvey released one for law, Sierra for voice agents. You earn the right to define what good means for a domain by being the one that domain already uses, and these companies won that right through the fight of real adoption.

The evaluations that decide real money are private and vary by company: this firm, on this matter, will accept what as good work, and it's nowhere near done because the depth of law dwarfs any public test. OpenEvidence is establishing what a safe clinical answer looks like. These are not really measurement, this is judgment about what's true and what's good, written down until it becomes the standard by which everyone else is measured, and the underlying lab, however smart, cannot write it because that status exists only inside the domain. That authority tends to land where it already sits. Senior lawyers write the law benchmark. Defining safe clinical answers falls to doctors. And what resolved means is whatever the company that already has the customer says it means.

The frontier of absorption keeps rising because we keep learning to measure more work, and the measurable gets eaten. The untrainable ground shrinks under the feet of whoever stands on it, so you cannot find a defensible point and rest. You keep moving toward whatever still can't be scored, you keep re-underwriting. On a narrow task, with your private data and your own evaluations, you can fine-tune to the frontier and beat the general-purpose model where it matters, and that specialized model becomes part of the moat. On the other hand, competing on the general-purpose model is a capital war you lose to whoever has the most compute, the trap for companies with shallow access and visible tasks. The day it promises survival by out-training the frontier on general tasks, the winner seems most determined by datacenter scale, and the end is usually not an independent champion but a sale to someone compute-rich.

All this is defense. The harder part is offense, choosing what to build in the first place. This is what I spent a year looking for, and I might have found it three times. The model doesn't help here. It will do whatever you point it at, but can't tell you what's worth pointing at, you cannot benchmark that, so you cannot train it. This is also why incumbents won't take everything: they hold the ground they have, and the next thing comes from whoever spots a use before the rest of us. Perhaps intention is a scarcer input than compute.

The despair theory is half right. The thin wrapper layers are indeed being absorbed, and a lot of what looks like a company today is a thin wrapper. It's wrong about what's left. The mechanism is clear; the destination is not. What I'd bet on is the direction: intelligence keeps getting cheaper, and value keeps sliding toward the few places the model cannot reach. The untrainable is value with history. So get into one, do the unglamorous translation, start writing down what good means there, because someone will. This year's most cited benchmark score is a map of territory about to become worthless, and a notice of who's about to lose the right to say what counts as good.

Preguntas relacionadas

QWhat is the main anxiety described among AI investors by 2026 according to the article?

AThe main anxiety is a feeling of despair that there is nothing left to invest in except for the leading model providers like Anthropic and hardware leaders like Nvidia, as models seemingly commoditize and absorb all value built on top of them, leaving no defensible moat for startups.

QWhy does the author argue that benchmarks are misleading indicators of a company's defensibility?

AThe author argues that benchmarks measure only what is publicly measurable and trainable. Therefore, any task that can be benchmarked is on a path to commoditization. True defensibility lies in 'untrainable ground'—private, hard-to-measure work involving integration, domain-specific knowledge, trust, and organizational change, which cannot be captured by a public score.

QWhat two factors create a 'wall' protecting valuable, 'untrainable' work from being absorbed by general AI models?

AThe two factors are: 1) The 'lock' of context—gaining trusted access to a private system requires security reviews, integrations, and contracts, which is a slow process. 2) The 'latch' of the user—establishing user habits and trust within an organization (like doctors using a specific tool) is based on relationships and slow adoption, not just superior model intelligence.

QHow do leading AI-native companies like Sierra and Cognition change their business models to align with the concept of 'untrainable' value?

AThey shift from selling based on inputs (like tokens) to pricing based on outcomes and guarantees. For example, Sierra charges only when its agent solves a customer's problem, and Cognition's Devin offers performance guarantees for software tasks. This is only possible because these companies have earned the trust to define what 'solved' or 'good' means within a specific client context.

QWhat is the author's final investment thesis or recommended direction in the face of AI models becoming universally capable?

AThe author advises betting on moving into areas of 'untrainable' value—where correctness is private, expensive to establish, and isolated within specific systems or organizations. The strategy is to do the 'unremarkable work of translation': integrate deeply into a domain, start defining what 'good' means within that private context, and build a moat based on relationships, trust, and proprietary workflow orchestration that models cannot easily replicate.

Lecturas Relacionadas

Promised Year of Crypto IPOs? Only One Went Public in Six Months, Down 70%

The much-anticipated wave of crypto IPOs in 2026 has failed to materialize, with market conditions worsening dramatically. While SpaceX prepares for the largest IPO in history, raising $75 billion at a $1.75 trillion valuation, the crypto sector faces a frozen pipeline. The sole crypto IPO success this year, BitGo, serves as a cautionary tale. After launching on the NYSE in January at $18, its stock has plummeted approximately 70%. Other major contenders have stalled or delayed. Kraken, which secretly filed in late 2025, has put its plans on ice, seeing its valuation drop 33% to $13.3 billion. Consensys has postponed its filing until autumn at the earliest, and Bitpanda is poised to miss its self-imposed H1 deadline for a Frankfurt listing. This widespread retreat is driven by a severe liquidity crunch. Bitcoin has fallen below $60,000, with capital being diverted to AI stocks and the massive SpaceX offering. The poor performance of earlier crypto listings like Gemini and the stagnant price of Coinbase further dampen investor appetite. A key underlying pressure is the impending US midterm elections in November, which could alter the currently favorable regulatory landscape. Companies had hoped to go public during this window of policy certainty, but challenging market dynamics have overridden those plans. The transparency that comes with being a public company is now seen as a potential liability rather than a benefit in a down market. The industry's fate now hinges on a few critical watchpoints: whether Kraken restarts its process in H2, if Consensys files in the fall, and if SpaceX's debut can revitalize market liquidity. Otherwise, the promised "crypto IPO year" will likely be pushed beyond the election.

marsbitHace 4 min(s)

Promised Year of Crypto IPOs? Only One Went Public in Six Months, Down 70%

marsbitHace 4 min(s)

Behind Musk and Huang Jen-hsun's 'AI Factories', an Unseen Battle for Freshwater Has Begun

Behind the "AI factories" of Elon Musk and Jensen Huang lies a hidden battle for a critical resource: fresh water. As AI models like ChatGPT and Claude process billions of prompts daily, they consume vast amounts of water for cooling. By 2030, global AI infrastructure is projected to use 9.3 trillion liters annually—enough to meet the basic needs of 1.3 billion people. This "water grab" stems from the massive heat generated by high-powered GPUs. Over 70% of data centers use evaporative cooling systems, where water absorbs heat and evaporates into the atmosphere, depleting local groundwater. Training models like GPT-4 can consume over 600 million liters of water. Tech giants like Google and Microsoft report skyrocketing water usage, sparking conflicts with local communities over resources. A flashpoint occurred in Memphis, Tennessee, where Musk's xAI built the Colossus supercomputer. It draws nearly 3.8 million liters of drinking water daily from local aquifers, leading to public outrage and legal action. In response, xAI is building an $80 million water recycling plant to use treated wastewater instead. Facing pressure, companies like Microsoft promote "waterless" closed-loop cooling systems. However, these systems increase electricity consumption by 20-30%, shifting the water burden to power plants, which require immense cooling water themselves—a case of indirect water footprint transfer. For China's AI industry, this crisis offers a strategic warning and opportunity. Instead of replicating the West's resource-intensive model, China can leverage its "East Data, West Computing" policy to locate data centers in cooler, water-rich regions like Guizhou. Furthermore, developing lightweight edge computing for smart homes and embodied AI robots can drastically reduce the need for constant cloud queries, cutting both water and energy consumption at the source. The freshwater war underscores a fundamental question: Will AI be a tool for human advancement or a silicon-based monster competing for our planet's last drops of clean water? The answer is becoming clearer as the water vapor rises.

marsbitHace 51 min(s)

Behind Musk and Huang Jen-hsun's 'AI Factories', an Unseen Battle for Freshwater Has Begun

marsbitHace 51 min(s)

AGI is Just One Step Away

The article discusses Anthropic's release of the Fable 5 model, a heavily restricted version of its powerful Mythos model. Initially unveiled in April, Mythos reportedly identified over 10,000 high-risk vulnerabilities for 50 enterprise clients, causing significant concern. Due to its dangerous capabilities in areas like autonomous cyber-attacks and biochemical weapons design guidance (classified as CB-1 level), the unaltered Mythos 5 remains limited to about 200 vetted entities like government agencies. Fable 5, released with a safety classifier, demonstrates extraordinary performance, leading benchmarks in coding (SWE-Bench Pro), software engineering, and research. It exhibits true "long-horizon agency," autonomously planning and executing complex, multi-step tasks like migrating 50 million lines of code in a day, moving beyond simple question-answering. The article positions Fable 5 at OpenAI's Level 3 ("Agent") and progressing toward Level 4 ("Innovator"), suggesting AGI (Artificial General Intelligence) is within reach, potentially 18-24 months away. To mitigate risks, Anthropic implemented a two-layer safety "cage": a silent routing system that redirects dangerous queries to a weaker model, and a mandatory 30-day data retention policy for all Mythos traffic to detect patterns of malicious use. Despite its high cost ($10/$50 per million input/output tokens), the model targets the enterprise market, where its unparalleled productivity and defensive capabilities against AI-powered cyber threats justify the premium. This signals a market maturation where top-tier AI becomes a strategic, high-value tool for businesses, potentially widening the gap with consumer-focused models and accelerating the rise of "one-person companies" while disrupting labor markets.

marsbitHace 1 hora(s)

AGI is Just One Step Away

marsbitHace 1 hora(s)

Trading

Spot
Futuros
活动图片