Claude Opus 4.8 Released, Anthropic Begins to Market "Trustworthiness" as a Product Feature

marsbitPublicado a 2026-05-29Actualizado a 2026-05-29

Resumen

Anthropic has released Claude Opus 4.8, emphasizing "trustworthiness" as a core selling point alongside performance gains. While it leads in five of six key benchmarks, the most significant improvement is a dramatic reduction in the model's failure to report its own errors in code tasks, cutting the rate from 19.7% to 3.7%. This increased reliability and self-awareness is positioned as crucial for deploying AI in real-world workflows. Key updates include enhanced mathematical and long-context reasoning, the introduction of dynamic multi-agent workflows in Claude Code for automated verification, and improved token efficiency. However, the model shows some regressions in areas like resisting prompt injection. Pricing remains unchanged. The release also signals the impending arrival of Anthropic's more powerful, restricted "Mythos"-tier model in the coming weeks. The article frames the industry shift as moving beyond pure benchmark competition towards a focus on reliability and verifiability, enabling users to delegate more critical tasks to AI agents with greater confidence.

Editor's Note: Anthropic releases Claude Opus 4.8, achieving first place in five out of six core benchmarks while keeping the price unchanged; Claude Code incorporates dynamic workflows, and the next-generation Mythos-level model has also entered market expectations.

Compared to simple performance improvements, what's more noteworthy about this release is that Anthropic is starting to shape "trustworthiness" into a core selling point for cutting-edge models.

In code honesty tests, Opus 4.8 significantly reduced the rate of failing to report its own errors; in Claude Code, it can orchestrate multiple sub-agents and introduce adversarial self-checking before delivering results. These changes collectively point to a practical issue: when AI moves from chat windows into real workflows, what users often worry most about is not the model's inability to complete tasks, but rather its tendency to provide seemingly complete, fluent, and self-consistent answers even when it's wrong.

Therefore, the significance of Opus 4.8 extends beyond a model upgrade; it also sends a clear industry signal: the competition among frontier models is shifting from a pure pursuit of benchmarks to a race for reliability, verifiability, and the ability to expose errors. For enterprise and professional users, the core threshold for AI in the next phase will increasingly depend on whether a model is trustworthy enough to be delegated to.

This is also a prerequisite for Agents to become truly usable. Models need to handle more tasks, but also need to make people feel confident in entrusting them with more important and complex tasks.

The following is the original text:

Anthropic today released Claude Opus 4.8. Among the six benchmark tests listed on the release card, it achieved first place in five of them.

The key change I'm most focused on is: in Anthropic's code summary honesty test, Opus 4.7 failed to flag its own errors 19.7% of the time; for Opus 4.8, this ratio dropped to 3.7%. For the same task, its ability to recognize errors in its own work improved roughly fivefold. Anthropic summarized this as "4x" in the announcement. However you calculate it, this is the key factor determining whether you can hand over real work to this model and walk away with confidence, and it's more important than any single benchmark score on the release card.

What Was Actually Released

First, the simplified version, then the specific numbers:

Reliability has truly improved. Apart from the code honesty data mentioned above, Opus 4.8 is also the first Claude model to achieve "literally zero" on two diligence tests: it reduced the frequency of "misreporting defective results" from 0.25 to 0.00, and dropped the incidence of "lazy investigations" from 25% to 0%. Overconfident erroneous responses decreased by about 11x. The tendency to favor its own work, a measurable bias in 4.7, has disappeared.

Dynamic workflows were added to Claude Code, currently in research preview. Claude now writes its own orchestration scripts, scheduling dozens to hundreds of sub-agents in parallel within a single session, and runs independent adversarial agents that attempt to refute the results before presenting them to you. This is the "agent teams" concept proposed in Opus 4.6, now automated.

It leads on its own release card, but not completely. It won five out of six. GPT-5.5 still leads in terminal operation tasks. Furthermore, the system card reveals some honesty regressions that Anthropic didn't put on the presentation slides, which will be discussed below.

No price changes. Still $5 per million input tokens and $25 per million output tokens, same as 4.7. However, the fast mode is now three times cheaper than before, though it remains a premium tier at $10 / $50.

Mythos is coming. Anthropic explicitly stated that the limited-access, extremely capable Mythos-level model will arrive in the coming weeks. Opus 4.8 is the public gateway to it.

Official Release Card: Benchmark Landscape

Below is the official release card, presented in our color scheme.

One category broke the sweep, and it's an important one. On Terminal-Bench 2.1, which tests a model's ability to complete long-horizon agent tasks via terminal, GPT-5.5 still leads with 78.2% versus Opus 4.8's 74.6%. Anthropic placed this loss on its own release card rather than hiding it. The "agent vs. craftsman" distinction we mentioned at GPT-5.5's launch hasn't been fully bridged: GPT-5.5 remains the stronger pure terminal operator, while Opus 4.8 is more like a stronger engineer in most tasks that real professional users care about, such as real-world coding, expert reasoning, computer use, and knowledge work.

Beyond the Release Card

The release card only shows six benchmarks. The 244-page system card reports on 40+ tests, and some of the most interesting results aren't on the slides. The following are noteworthy:

Mathematical ability improved by 27 percentage points. On the USAMO 2026, the United States of America Mathematical Olympiad held this March, Opus 4.8 scored 96.7%, while 4.7 scored 69.3%. Since this competition occurred after Opus 4.8's training cutoff, there's no data contamination issue. This is the largest intergenerational leap on the entire card.

Advantage widens in long-context scenarios. In a million-token graph reasoning test, Opus 4.8 scored 68.1, compared to 4.7's 40.3 and GPT-5.5's 45.4. The longer the context and the harder the task, the more its lead widens.

Multi-agent is where it truly excels. A single Opus 4.8 agent trails behind Gemini in web research tasks, scoring 84.3 vs. 85.9 respectively. However, if an orchestrator schedules a team of sub-agents, its score reaches 88.5%, becoming the highest reported result; a five-agent team can also achieve the single-agent best score in one-fifth of the time. This is the dynamic workflow feature manifesting in benchmarks.

Token efficiency sees a qualitative change. On the hardest coding tests, Opus 4.8 at its lowest effort setting achieves the performance that Opus 4.7 achieved at its highest effort setting. In other words, you get past peak performance at lower token cost.

It crossed a threshold no model had crossed before. On Harvey's Legal Agent Benchmark, a task is considered successful only if every single scoring criterion within the task is passed. Opus 4.8 is the first model to rank #1 under this "all-pass" criterion. It passed 89% of individual criteria, but the full task pass rate was only 9.6%, illustrating just how stringent real legal work requirements are.

There are honestly presented regressions. Three things are indeed worse than 4.7, and Anthropic acknowledges them in the system card. GPQA Diamond, an expert science test, dropped from 94.2 to 93.6. Refusal capabilities in computer use scenarios and resistance to prompt injection regressed, making 4.8 easier to manipulate in agent scenarios. Additionally, in a one-year simulated business test, it ended up with only one-third of the cash remaining compared to 4.7. These didn't appear on the release card, which makes pointing them out even more worthwhile.

Compared to Open-Weights Models, Where Does It Stand

The release card only compares Opus 4.8 with other closed-source frontier models. Expanding the view to include the cheap open-weights models many teams are testing today paints a picture that almost defines the 2026 AI industry: Opus 4.8 leads in capability, but the gap with free, self-hostable models is only a few percentage points, while the price gap is immense.

The chart above includes a full comparison of eight models. DeepSeek's price reflects its permanent 75% discount; Qwen Max's price hasn't been announced yet.

Opus 4.8 wins outright on coding benchmarks. But Qwen3.7-Max, an open model you can run yourself, scores 60.6, trailing by only about 9 points. DeepSeek V4-Pro scores 55.4, while its output price is roughly one-thirtieth of Opus's. For the highest-risk engineering tasks, the $25 per million output token difference might be worth paying. For a large volume of daily work, that gap is increasingly not worth it. And that's precisely the calculation every serious team is making right now.

What This Means for You

If you're using Opus 4.7, this is a free upgrade. Same price, better data, and noticeably more reliable judgment of its own outputs. Just switch to it.

The more interesting question is: What work are you now willing to delegate to it? Every reader has a line separating "tasks I can let AI do" and "tasks I must do myself because I can't yet trust the handover." The reliability improvements in 4.8 mean you can push that line forward a step. The model is better at flagging its own uncertainty, which lowers the cost of "silent error handoffs" and expands the range of tasks worth delegating to the model. This is the practical meaning of the honesty data; it's more important than any single score.

This also echoes what we wrote last week. Anthropic's own AI Fluency research found that when model outputs appear polished and complete, people become significantly less likely to notice missing context. The answer looks finished, so we stop checking. Opus 4.8 attacks this failure mode from the model side: it's better at telling you where a seemingly clean, complete answer might still have weaknesses. It can't replace your judgment, but it gives your judgment something to grasp onto.

If you use Claude Code, try the dynamic workflow this week with a genuinely large task—like a migration or a comprehensive check across many files—while keeping an eye on the token meter. This capability is real, and adversarial self-checking is key to making outputs more trustworthy. But the cost is also real. This is a tool for large tasks a single agent struggles with, not your new daily default.

What's Next: Mythos, Coming in Weeks

The most forward-looking statement in this release isn't actually about 4.8. Anthropic stated that the Mythos-level model will arrive in the coming weeks, positioning Opus 4.8 as a public step towards it.

You need to understand what this means. Mythos is a restricted frontier model Anthropic has been benchmarking internally, surpassing the released Opus 4.8 on almost all metrics: 93.9% on SWE-bench Verified; in cybersecurity tests, it can generate runnable exploits against most targets in current browsers, whereas Opus 4.8 has a success rate below 10%. It was previously available to only about 52 vetted institutions at a price five times that of standard Opus, treated as infrastructure rather than a regular product.

Therefore, when an even more powerful Mythos-level model lands in the coming weeks, it should be understood through a "two-market" framework: one is the commodified layer, Opus 4.8, broadly accessible, price unchanged, increasingly chased by free open models; the other is the controlled frontier layer, Mythos, expensive, access-restricted. These aren't separate products but different tiers on the same continuum of capability. The reliability work in 4.8 is exactly what you need to build before your real goal is "to run the model with less supervision." And that goal is now not quarters away, but weeks.

Background: How We Got Here

If you've lost the rhythm of the past four months, think of it this way: Opus 4.6 in February brought agent teams, Sonnet 4.6 brought price collapse, Opus 4.7 in April brought a reasoning leap, and Mythos has been the vaguely visible restricted ceiling on the side. Opus 4.8 connects two of these threads: it continues the orchestration narrative from 4.6 and is also the gateway to Mythos.

This release cadence itself is the key fact hidden beneath all surface changes. The flagship model has gone from 4.5, 4.6, 4.7 to 4.8 in months, and the model you standardize on for your team today might not be the one you're actually running by fall. This is also why, rather than investing in usage tricks for a specific model, it's better to invest in capabilities that can migrate across models, like clear delegation and rigorous verification.

Benchmark sweeps get screenshots shared. But the place where things are really changing is smaller and more important: this is the first Claude version whose core selling point isn't just "it's smarter," but "you can entrust more to it." Before agents become truly useful, the entire industry must move in this direction; and this part of the capability is also the hardest to fit into a chart.

Where is your line now? What work are you willing to hand to a model, and what still requires doing yourself? And what would need to happen for you to push that line forward another step?

Preguntas relacionadas

QWhat is the most significant improvement in Claude Opus 4.8 according to the article, beyond benchmark performance?

AThe most significant improvement is its enhanced reliability and trustworthiness. Specifically, in code honesty tests, Opus 4.8 reduced its failure to report its own errors from 19.7% (in Opus 4.7) to 3.7%. This focus on exposing uncertainty and reducing 'silent failure' is framed as a core competitive advantage for professional and enterprise use.

QHow does the new 'dynamic workflow' feature in Claude Code improve the model's output?

AThe dynamic workflow feature allows Claude Code to automatically write orchestration scripts to schedule and run dozens to hundreds of sub-agents in parallel within a single session. Crucially, it can also run independent 'adversarial' agents that attempt to rebut the primary results before they are presented to the user, adding a layer of verification and self-checking to improve trustworthiness.

QWhat is one key area where GPT-5.5 still outperforms Claude Opus 4.8, and what does this represent?

AGPT-5.5 still leads in terminal operation tasks, specifically on the Terminal-Bench 2.1 benchmark (78.2% vs. 74.6%). The article suggests this highlights a remaining divide between models excelling at pure 'craftsman'-like terminal operations (GPT-5.5) and those stronger at complex 'engineer'-like tasks such as real-world coding and expert reasoning (Opus 4.8).

QWhat is the 'Mythos' model mentioned in the article, and how is it positioned relative to Opus 4.8?

AMythos is a highly capable, restricted-access frontier model from Anthropic that outperforms the publicly released Opus 4.8 on nearly all metrics. It is positioned as a 'controlled frontier' layer—expensive and available to a limited number of vetted organizations—while Opus 4.8 represents the commoditized, widely available layer. Opus 4.8's reliability improvements are seen as foundational steps toward enabling models like Mythos to run with less supervision.

QAccording to the article, what broader industry shift does the release of Opus 4.8 signal?

AThe release signals a shift in frontier model competition from a pure pursuit of benchmark scores to a focus on reliability, verifiability, and the ability to expose errors. For professional users, the next phase's key differentiator will increasingly be whether a model is 'trustworthy enough to delegate to,' which is a prerequisite for AI agents to become genuinely useful in real-world workflows.

Lecturas Relacionadas

BIS Latest Research: The Future of Stablecoins and the Global Monetary Landscape

BIS Working Paper No. 170, released in May 2026, analyzes the impact of stablecoins on the global monetary system. The market has grown exponentially since 2014, with over 300 active stablecoins exceeding $300 billion in market capitalization. It is highly concentrated, dominated by USD-linked stablecoins (98% by market cap, mainly USDT and USDC), which function as new forms of private offshore dollar claims on blockchain. Currently, stablecoin use remains largely within crypto ecosystems for trading and DeFi collateral. Real-economy adoption, such as in cross-border payments, is nascent but growing in emerging markets and developing economies (EMDEs) facing high inflation and volatile currencies, where they facilitate capital flight and "digital dollarization." The paper assesses impacts using the Cohen-Kennen framework. For private-sector functions, stablecoins most directly affect value storage (as a dollar-denominated safe haven in EMDEs) and the medium of exchange (enhancing cross-border payment efficiency, further entrenching dollar use). Impacts on the unit of account and official-sector functions are currently limited but could indirectly constrain monetary policy autonomy and capital controls. The report outlines three potential future scenarios: 1) **Niche adoption**, where stablecoins remain crypto-centric with minimal systemic impact; 2) **Digital dollarization**, a high-risk scenario where USD stablecoins become de facto standards in EMDEs, eroding monetary sovereignty; and 3) **Local currency stablecoin integration**, an ideal but challenging scenario where regulated domestic stablecoins linked to CBDCs enhance efficiency without foreign currency substitution. Key policy recommendations emphasize global coordination: establishing uniform regulatory standards (e.g., for reserves and disclosure), strengthening cross-border supervisory cooperation, enhancing domestic defenses in EMDEs (via macroeconomic stability, improved payment systems, and CBDCs), and combating illicit activities. The paper concludes that stablecoins are a structural force reinforcing dollar dominance in the near term, posing significant risks to EMDEs' financial stability and policy autonomy. Their long-term trajectory depends on regulatory responses, adoption patterns, and the co-evolution with public digital currencies.

marsbitHace 7 min(s)

BIS Latest Research: The Future of Stablecoins and the Global Monetary Landscape

marsbitHace 7 min(s)

BIS Latest Research: Stablecoins and the Future of the Global Monetary Landscape

The Bank for International Settlements (BIS) Working Paper No. 170 analyzes the rise of stablecoins and their impact on the global monetary system. Stablecoins, privately issued digital tokens pegged to fiat currencies, have grown exponentially since 2014, with a market dominated by USD-pegged variants like USDT and USDC. Their core function remains within the crypto ecosystem, though use in cross-border payments and as a store of value in high-inflation emerging markets is increasing. The report identifies stablecoins as a new form of offshore dollar claims, extending dollar liquidity via blockchain. Their stability depends entirely on reserve quality and market arbitrage, lacking traditional banking safeguards. In the short term, stablecoins reinforce the US dollar's dominance, posing risks to monetary sovereignty in emerging market and developing economies (EMDEs) by facilitating "digital dollarization," which can undermine local currency deposits, capital controls, and monetary policy effectiveness. The BIS outlines three potential future scenarios: 1) Niche adoption within crypto (baseline), 2) Widespread "digital dollarization" in EMDEs (high-risk), and 3) Integration of domestic currency stablecoins (ideal but challenging). Effective global regulatory coordination is crucial to manage risks like reserve transparency, cross-border spillovers, and illicit activities. The report concludes that stablecoins represent a structural force reshaping international monetary hierarchies, presenting both opportunities for payment efficiency and significant risks to financial stability and autonomy, necessitating robust policy responses.

链捕手Hace 12 min(s)

BIS Latest Research: Stablecoins and the Future of the Global Monetary Landscape

链捕手Hace 12 min(s)

Solo Company Craze: Some Earn Millions Annually, Others See Incomes Shrink by 90%

The Rise of the "One-Person Company" (OPC): AI Fuels a Solo Entrepreneurship Wave The concept of the "One-Person Company" (OPC)—where an individual leverages AI tools to start and run a business—is gaining significant traction, hailed by some as ushering in a "golden age" for solo entrepreneurship. While success stories abound, the reality is a mixed picture of high earnings and significant struggles. The article profiles several OPC founders across different industries: * A game developer created 6 bullet-chat (danmaku) games in a year using an AI-powered workflow, earning approximately 1 million RMB. AI handled around 70% of art and 99% of coding tasks, slashing development cycles from months to about 15 days per game. * A materials researcher in Japan, using AI for tasks from translation to legal advice, earns roughly triple the salary of a local white-collar worker. * A biotech entrepreneur uses AI Agents to automate 80% of repetitive work like data analysis, doubling their previous income while gaining time freedom. * Conversely, a former tech executive turned cross-border e-commerce founder in Latin America reports a 90% drop in income compared to their previous corporate job, cautioning against blindly following the trend. Key insights from these cases include: AI dramatically lowers barriers to entry and operational costs, but does not guarantee success. It excels at automating repetitive tasks but cannot replace core human skills like creativity, project management, judgment, and client acquisition. Industry experience and existing client/resources remain critical advantages. The model suits self-starters with specific expertise but poses challenges in areas like sales, compliance, and scaling. Ultimately, while AI empowers solo ventures, entrepreneurship's inherent risks and demands persist.

marsbitHace 19 min(s)

Solo Company Craze: Some Earn Millions Annually, Others See Incomes Shrink by 90%

marsbitHace 19 min(s)

Goldman Sachs Research Report Analysis: Chip Shortage to Persist Until 2028, Maintain Buy Recommendations

Goldman Sachs Research Report Summary: Memory Shortage Until 2028, Maintain Buy Recommendations Goldman Sachs' latest Asia-Pacific equities report, "The 720," forecasts a sustained memory chip upcycle extending into 2028, driven by strong AI server demand visibility, limited supply growth, and binding long-term agreements. The firm believes the market significantly underestimates the cycle's duration, as evidenced by low P/E ratios for memory stocks. Key sector calls include raising 12-month price targets for Samsung Electronics and SK Hynix, and upgrading Kioxia from Hold to Buy, citing higher and more sustainable peak profits over the next 2-3 years. The report also highlights the broader AI hardware supply chain benefiting from hyperscaler capex acceleration. Recommendations include: * MediaTek (Buy) for its data center/ASIC pivot. * Eoptolink (Buy) on 1.6T optical module ramp-up. * Biren (Buy) for its AI chip migration. * Huaqin (Buy, newly covered) for its shift from consumer electronics ODM to AI data centers. * Lenovo (Buy) on the AI PC refresh cycle. Other notable mentions include China property developers (under an optimistic scenario), BYD for its affordable city NOA strategy, and select Japanese semiconductor equipment makers. A macro theme notes the divergence between AI-boom beneficiaries (e.g., Korea, Taiwan) and energy-importing economies facing inflationary pressure. The report concludes with standard disclaimers, noting that price targets are forward-looking estimates and that sell-side research has an inherent bullish bias. The core investment thesis hinges on the longevity of the memory upcycle and the AI-driven capex wave.

marsbitHace 52 min(s)

Goldman Sachs Research Report Analysis: Chip Shortage to Persist Until 2028, Maintain Buy Recommendations

marsbitHace 52 min(s)

Trading

Spot
Futuros
活动图片