# Benchmark Related Articles

HTX News Center provides the latest articles and in-depth analysis on "Benchmark", covering market trends, project updates, tech developments, and regulatory policies in the crypto industry.

The Recursive AI Anthropic Warned About: Tian Yuandong's New Company Has Just Taken the "First Step"

Anthropic recently highlighted the rapid progress toward "recursive self-improvement," where AI systems autonomously design and train their successors. In response, Recursive Superintelligence, a new company co-founded by former Meta researcher Tian Yuan Dong, has publicly demonstrated its first step toward automating AI research. The company released a system designed to autonomously execute the full AI research cycle: generating ideas, implementing code, running experiments, and learning from results. It validated this approach by achieving state-of-the-art results on three diverse benchmarks: 1. **NanoChat Autoresearch:** Optimizing a small language model's validation loss under a fixed 5-minute GPU budget, improving upon the community's best result. 2. **NanoGPT Speedrun:** Reducing the time to train a GPT model to a specific loss on 8 H100 GPUs from 79.7 seconds to 77.5 seconds, beating a highly optimized, human-driven community effort. 3. **SOL-ExecBench:** Improving the overall score on NVIDIA's suite of 235 GPU kernel optimization tasks by 18%, closing the gap to the hardware limit. The system discovered novel optimizations in this highly specialized domain without direct human expertise. Recursive's system operates as a general framework, capable of parallel exploration and cross-task knowledge transfer while incorporating safeguards against reward hacking. The company, backed by $650M in funding and a star-studded team including Richard Socher and Alexey Dosovitskiy, aims to create AI that recursively enhances its own research capabilities. This development represents an early but concrete move toward a new paradigm where AI accelerates its own advancement. It occurs alongside Anthropic's warnings about the need for industry coordination and potential pauses when recursive self-improvement thresholds are reached, highlighting the dual trajectory of rapid technical progress and growing calls for careful stewardship.

marsbit10h ago

The Recursive AI Anthropic Warned About: Tian Yuandong's New Company Has Just Taken the "First Step"

marsbit10h ago

"I Don't Need a Better Model Anymore": A Panorama of AI Users Under a Reddit Hot Post

Titled "I Don't Need a Better Model Anymore": AI User Reactions on Reddit Anthropic recently released Claude Fable 5, its first publicly available 'Mythos'-tier model, achieving 80.3% on the SWE-Bench Pro benchmark and significantly outperforming its predecessor and competitors. However, a viral Reddit post titled "Claude Fable made me realize I don't need better models anymore" highlighted a growing user sentiment of "good enough." Top comments expressed "model fatigue," with users stating that earlier models like Opus 4.5/4.8 already sufficed for their workflows. High cost was a key concern, as Fable 5's API is nearly twice the price of Opus 4.8, with users questioning the return on investment and suggesting the field has hit a plateau. The most frequent complaint targeted Fable 5's stringent safety filters. Designed to intercept high-risk requests (e.g., cybersecurity), the system was perceived as overly conservative. Users reported frequent rejections for routine security-related tasks, leading to automatic fallbacks to the older Opus model. Paying users were particularly frustrated, feeling they paid a premium for a less usable product. Dissenting voices came from users with heavy, complex tasks. For workloads like high-energy physics simulations with thousands of code lines, Fable 5's improved long-context understanding and error detection represented a significant, worthwhile leap—described as moving from a "college player to an NBA starter." The debate underscores a divergence between benchmark performance and practical utility. For most users, current models meet their needs, making further advances relevant only for extreme use-cases. The discussion also raised concerns about a potential "Public AI Freeze," where the most powerful models (like the restricted Mythos 5) remain exclusive to enterprises and governments, while public offerings stagnate. The launch presents two report cards: one of technical excellence and another of user skepticism. Fable 5's ultimate reception may depend on Anthropic's ability to refine its safety filters and justify its cost for specialized, high-demand users.

marsbit11h ago

"I Don't Need a Better Model Anymore": A Panorama of AI Users Under a Reddit Hot Post

marsbit11h ago

AGI is Just One Step Away

The article discusses Anthropic's release of the Fable 5 model, a heavily restricted version of its powerful Mythos model. Initially unveiled in April, Mythos reportedly identified over 10,000 high-risk vulnerabilities for 50 enterprise clients, causing significant concern. Due to its dangerous capabilities in areas like autonomous cyber-attacks and biochemical weapons design guidance (classified as CB-1 level), the unaltered Mythos 5 remains limited to about 200 vetted entities like government agencies. Fable 5, released with a safety classifier, demonstrates extraordinary performance, leading benchmarks in coding (SWE-Bench Pro), software engineering, and research. It exhibits true "long-horizon agency," autonomously planning and executing complex, multi-step tasks like migrating 50 million lines of code in a day, moving beyond simple question-answering. The article positions Fable 5 at OpenAI's Level 3 ("Agent") and progressing toward Level 4 ("Innovator"), suggesting AGI (Artificial General Intelligence) is within reach, potentially 18-24 months away. To mitigate risks, Anthropic implemented a two-layer safety "cage": a silent routing system that redirects dangerous queries to a weaker model, and a mandatory 30-day data retention policy for all Mythos traffic to detect patterns of malicious use. Despite its high cost ($10/$50 per million input/output tokens), the model targets the enterprise market, where its unparalleled productivity and defensive capabilities against AI-powered cyber threats justify the premium. This signals a market maturation where top-tier AI becomes a strategic, high-value tool for businesses, potentially widening the gap with consumer-focused models and accelerating the rise of "one-person companies" while disrupting labor markets.

marsbitYesterday 05:10

AGI is Just One Step Away

marsbitYesterday 05:10

From Hunyuan to WeChat AI: Tencent's Slow Paced Journey Reaches the Delivery Juncture

On June 8, 2026, WeChat's developer platform announced the internal testing of "WeChat AI," an AI assistant integrated into the WeChat ecosystem. It allows users to invoke, access, and operate Mini Programs through natural language conversation. The platform offers two access modes: an "Automatic Mode" where developers authorize platform access to their source code for zero-configuration AI operation, and a "Developer Mode" for building custom skills. While the name "WeChat AI" is provisional, this marks WeChat's first step in opening its vast Mini Program ecosystem—comprising over 400,000 developers and hundreds of millions of daily active users—to AI-driven conversational interaction. This move represents the latest step in Tencent's deliberate AI strategy, moving from technical R&D and standalone product validation to integration within its super-app. The underlying foundation is Tencent's self-developed Hunyuan large language model. Ranked first domestically in application-oriented capabilities like Agent task execution in 2025, Hunyuan's focus on stability and precision over raw parameter count aligns with WeChat AI's need for reliable, low-latency operations involving sensitive tasks like payments and bookings. Prior C-side validation came from "Yuanbao," a standalone AI app whose Monthly Active Users (MAU) surpassed 114 million during the 2026 Chinese New Year红包 campaign, though daily activity later subsided. This "pulse growth" highlighted the challenge of user retention for standalone apps, informing the decision to integrate AI natively into WeChat's high-frequency scenarios. However, WeChat AI's "Automatic Mode," which requires source code access, raises developer concerns about code security, data visibility, and liability for AI errors. A deeper, ecosystem-level tension exists between the efficiency of centralized AI task调度 and the potential "short-circuiting" of merchant pages, which could erode their branding, advertising revenue, and user engagement. As Tencent Chairman Pony Ma noted, balancing centralized AI调度 with the protection of decentralized merchant traffic is a core challenge. In summary, Tencent's AI path—comprising the stable Hunyuan base model, the user-validated Yuanbao app, and the newly testing WeChat AI integration—is logically coherent. The success of WeChat AI now hinges on resolving developer trust, establishing fair ecosystem rules for merchants, and ensuring operational reliability to gain user confidence for deep, transactional use.

marsbit06/08 10:23

From Hunyuan to WeChat AI: Tencent's Slow Paced Journey Reaches the Delivery Juncture

marsbit06/08 10:23

Valuation Surpasses 200 Billion, Kimi Reportedly Raises 13.6 Billion More, Speeds Up Hong Kong IPO

Beijing-based AI unicoth MoonDark (Kimi) is reportedly in talks for a new funding round aiming to raise up to $20 billion (approximately RMB 136 billion), targeting a post-money valuation of $300 billion (approximately RMB 2.035 trillion). If successful, this would mark its third round in six months and a six-fold increase from its $43 billion valuation in December last year. Last month, the company completed a $20 billion funding round led by Meituan Longzhu, reaching a valuation exceeding $200 billion. According to reports, MoonDark has raised over RMB 376 billion across six rounds, making it the most funded large language model startup in China. Founded in 2023 by CEO Yang Zhilin, the company's core product is the Kimi AI Assistant. In April, it launched and open-sourced its flagship model, Kimi K2.6, which has demonstrated performance comparable to top models like GPT-5.4 in certain benchmarks. Recently, it began beta testing for Kimi Work, a local AI agent for knowledge workers. Commercially, the company's Annual Recurring Revenue (ARR) reportedly surpassed $2 billion in April. Regarding its IPO plans, Bloomberg reported in March that MoonDark is preparing for a listing in Hong Kong, though the process remains in early stages. The funding and IPO pace for leading Chinese AI firms has accelerated notably in 2026, mirroring global trends where companies like OpenAI and Anthropic are also setting new fundraising and valuation records. Securing substantial capital is becoming a critical factor in the competitive landscape alongside model capabilities.

marsbit06/08 07:45

Valuation Surpasses 200 Billion, Kimi Reportedly Raises 13.6 Billion More, Speeds Up Hong Kong IPO

marsbit06/08 07:45

Just Now, Chinese AI Enters Top 2 in Global Programming, Only Claude Remains Ahead

**China's AI Ranks Second Globally in Programming, Trailing Only Claude** Today, Alibaba's Qwen3.7-Max achieved a score of 1541 on the Code Arena benchmark, securing fourth place globally and surpassing top models like GPT-5.5 and Gemini 3.5 Flash. Among the top positions, it is now the only non-Claude model, placing second overall after Anthropic's Opus models. Before this official ranking, Qwen3.7-Max had already gained recognition overseas. In practical tests, it outperformed rivals on tasks like creating a self-training Tetris AI and generating complex 3D models, often at a significantly lower cost. Developers praised its ability, especially when integrated with tools like Hermes Agent and OpenCode, to effectively replace models such as GPT-5.5. In a hands-on challenge to create a 3D racing game from a detailed prompt, Qwen3.7-Max delivered a fully playable HTML file in the first attempt, requiring only minor bug fixes. It uniquely included a start menu and sound effects—details missed by other models. While competitors like Gemini 3.5 Flash and Claude Opus 4.6 produced less polished or functional versions, and GPT-5.5 had its own quirks, Qwen3.7-Max stood out for its initial completeness and playability. This performance stems from its design as an "Agent Base Model," built for long-duration, autonomous task execution. Internal tests show it can run continuously for 35 hours, making over 1158 tool calls without context degradation or instruction drift. Key technical advancements include "environment expansion" training, which improves adaptability across different frameworks, and "long-horizon autonomous execution" training, enabling sustained strategic decision-making. By entering the top tier of the programming arena, Qwen3.7-Max demonstrates that Chinese AI models are not just catching up but are becoming defining competitors, challenging the long-standing dominance of Silicon Valley in this field.

marsbit05/27 00:17

Just Now, Chinese AI Enters Top 2 in Global Programming, Only Claude Remains Ahead

marsbit05/27 00:17

The Paradox of Automation: The Stronger the AI, the Busier Humans Become

The Paradox of Automation: The more powerful AI becomes, the more work humans have to do. This article, based on observations from AI-heavy company Every, argues that while AI agents automate tasks like coding, writing, and customer service, they don't eliminate human jobs. Instead, they transform work and create *more* demand for human expertise. AI commoditizes "yesterday's human capabilities" by cheaply generating code, text, and images from past data. This leads to an abundance of similar, generic outputs. Consequently, what becomes scarce and valuable is human judgment in the present moment: knowing *what* is worth doing, *why*, and *how* to do it well. The article identifies two collaboration models: "Agent employees" for delegated tasks and "human-AI collaboration" within tools like Claude Code for complex work. In both cases, humans are essential to set direction, judge quality, and maintain systems. As AI makes execution cheap, human roles shift from executors to designers, reviewers, and meaning-makers. The author addresses "benchmark anxiety" by explaining that AI excels within specific, human-defined problem "frames." As AI masters one frame (e.g., code rewriting), new, more complex frames emerge (e.g., deciding *when* to rewrite). This creates an ongoing cycle where AI chases the frames, but humans remain the "framers." Even with advanced AGI, this dynamic may persist as long as AI lacks true human-like agency and self-directed purpose. The core paradox holds: automation amplifies the need for the very human judgment it seems to replace.

marsbit05/24 07:06

The Paradox of Automation: The Stronger the AI, the Busier Humans Become

marsbit05/24 07:06

活动图片