# Benchmark Related Articles

HTX News Center provides the latest articles and in-depth analysis on "Benchmark", covering market trends, project updates, tech developments, and regulatory policies in the crypto industry.

TRON included in S&P Pantera Digital Asset Index as institutional benchmarking expands to blockchain networks

TRON DAO welcomes the inclusion of the TRON blockchain in the newly launched S&P Pantera Digital Asset Index, a benchmark co-developed by S&P Dow Jones Indices and Pantera Capital. The index's methodology, focusing on protocol utility, onchain liquidity, and network activity, signals the extension of traditional financial market frameworks to digital assets. TRON's selection reflects its significant scale, supporting over 394 million user accounts and facilitating more than $90 billion in USDT. The network leads in USDT transfer volume, with approximately $4.5 trillion year-to-date. Recent integrations with regulated U.S. firms have expanded institutional access. TRON founder Justin Sun stated that such transparent benchmarks mark the maturation of digital assets as an institutional asset class, where utility and adoption are key measures of a network's significance.

cointelegraph2 days ago 09:38

TRON included in S&P Pantera Digital Asset Index as institutional benchmarking expands to blockchain networks

cointelegraph2 days ago 09:38

Claude Opus 5 Leaks, First Wave of User Tests Arrive

Claude Opus 5 appears to have been leaked, with early test results circulating online. Users report generating highly detailed 3D scenes, such as a catapult attacking a castle with intricate parameter cards, dynamic weather interfaces, and realistic kitchen renders. Comparisons show Opus 5 outperforming the current Fable 5 in detail density for similar prompts. Other demos include impressive Minecraft recreations and detailed SVG graphics. Evidence of the leak includes sightings in Cursor's model selector (under the code "Honeycomb EAP") and Google Vertex AI, followed by reports of the model appearing for some users, though officially labeled as version 4.8. Speculation is growing that Opus 5 could match Fable 5's capabilities at half the price per token, though concerns exist about potentially higher token consumption. The model's formal release seems imminent.

marsbit07/24 07:53

Claude Opus 5 Leaks, First Wave of User Tests Arrive

marsbit07/24 07:53

Large Models No Longer "Guess" Image Scores, Use "Visual Evidence" like Structural Maps and Spectrograms as "Physical Evidence" to Score Images

Multimodal large language models (MLLMs) often perform poorly on image quality assessment (IQA), as they primarily rely on semantic understanding and are insensitive to underlying degradations like noise or blur. To address this, researchers from Northwestern Polytechnical University and Hong Kong University of Science and Technology introduced IQA-T1, a novel framework that enables MLLMs to generate and use structured "visual evidence"—such as noise residual maps, Fourier magnitude spectra, and gradient orientation coherence maps—for reasoning. IQA-T1 actively selects tools from a dedicated perceptual library to produce this evidence, shifting from intuitive "guesswork" to evidence-based assessment. Trained via supervised fine-tuning and reinforcement learning on a newly created Q-Tool dataset, the model learns to call tools efficiently. Evaluated across seven IQA benchmarks, IQA-T1 achieves state-of-the-art average PLCC/SRCC scores of 0.795/0.784 while providing explainable, traceable reasoning chains.

marsbit07/20 07:47

Large Models No Longer "Guess" Image Scores, Use "Visual Evidence" like Structural Maps and Spectrograms as "Physical Evidence" to Score Images

marsbit07/20 07:47

Shocking Mega-Scam: The 'Mysterious Lab' That Topped Global Charts Overnight Was Fake

On July 18th, a sensational announcement shook the AI community: a mysterious Chinese AI lab called "Basalt Labs" had seemingly released the "Monolith-1.0" model, topping major global benchmarks with claims of 1.6 trillion parameters and near-perfect scores. Its professional-looking website and technical paper fueled widespread excitement, raising questions about a sudden leap in China's AI capabilities. However, the story quickly unraveled. Developers discovered the model's Hugging Face repository contained duplicated weight files from a much smaller model. The impressive web demo was found to be a shell, secretly using DeepSeek's API for its responses. Prompt leaks further confirmed the deception. The mastermind, Max Scherf, later admitted it was an elaborate hoax, a "social experiment." He revealed creating the illusion by fine-tuning a small open-source model on leaked benchmark answers, fabricating all data, building a convincing website, and launching a viral marketing campaign. His goal was to expose the AI industry's vulnerabilities: an over-reliance on impressive benchmarks and hype, coupled with a lack of immediate, thorough scrutiny. Ironically, the scam highlighted the genuine strength of Chinese AI models like Qwen and DeepSeek, whose capabilities were credible enough to temporarily impersonate a "world-leading" model.

marsbit07/20 02:48

Shocking Mega-Scam: The 'Mysterious Lab' That Topped Global Charts Overnight Was Fake

marsbit07/20 02:48

DeepSeek V4 'Full-Blooded Edition' Leaked, Could Be Released As Early As Tomorrow

The highly anticipated full release of DeepSeek V4 is imminent, expected to launch as early as tomorrow after nearly three months of waiting. A select group has already received access to the GA (General Availability) beta, which includes two versions: DeepSeek V4 Flash and DeepSeek V4 Pro. Early testers report that V4's overall performance is close to the level of Opus 4.8, with coding capabilities rivaling GPT-5.6 Sol. Its agent abilities are significantly enhanced, and 3D/SVG generation has improved notably. While it may not surpass the recently released Kimi K3 in performance, its expected price point is significantly lower. The official release will introduce a new "peak/off-peak" pricing model for its API. For example, deepseek-v4-pro will cost $0.87 per million output tokens during standard times and $1.74 during peak hours. The flash version is even more aggressive at $0.28/$0.56 per million tokens, with cached input tokens priced extremely low at $0.0028. This makes V4 a strong contender in terms of cost-effectiveness, potentially offering Opus-level capabilities at a fraction of the cost, continuing DeepSeek's reputation as a "price disruptor" in the AI market. Initial demos showcasing V4's capabilities have begun circulating, including generated 3D simulation games, HTML games blending elements of Minecraft and No Man's Sky, and classic games like a "Cut the Rope" clone. The final GA version is set to replace the older deepseek-chat and deepseek-reasoner models, which will be retired on July 24th.

marsbit07/19 05:31

DeepSeek V4 'Full-Blooded Edition' Leaked, Could Be Released As Early As Tomorrow

marsbit07/19 05:31

GPT-5.6's IQ Breaks 130 Genius Threshold for the First Time, Outsmarting 99% of Humans

GPT-5.6 has reportedly achieved an IQ score of 136 on Tracking AI's proprietary offline test, surpassing the human "genius" threshold of 130 for the first time. This places it above an estimated 99% of humans in this specific metric. The test is designed to prevent memorization by using a private question bank. Multiple GPT-5.6 variants, including the vision model, consistently scored 136, leading competitors like Claude-5 Fable (130). User anecdotes suggest practical superiority over rivals in real-world coding and problem-solving tasks, such as building a physics simulation or a customer service app from a single prompt. While some speculate this approaches AGI for most users, the article notes IQ tests only measure a narrow slice of cognitive ability like pattern recognition. The significance lies in GPT-5.6's apparent ability to translate high test scores into effective task performance on novel, real-world problems.

marsbit07/16 08:22

GPT-5.6's IQ Breaks 130 Genius Threshold for the First Time, Outsmarting 99% of Humans

marsbit07/16 08:22

Scaling Law a One-Size-Fits-All Solution? First Crystal Structure Manipulation Benchmark Shows Top Large Models Falling Short

Scaling Law Hits a Wall: New Benchmark Reveals AI's Struggles with Atomic-Level Material Manipulation A new benchmark called AtomWorld, developed by researchers, reveals a significant limitation in current large language models (LLMs). While powerful at understanding textual scientific knowledge, they perform poorly when tasked with physically manipulating atomic structures based on natural language instructions. The benchmark tests core atomic operations like replacing atoms, rotating structures, and expanding supercells. Results show that simply scaling up model size (Scaling Law) yields only modest and unstable improvements, particularly for tasks requiring strong 3D spatial reasoning and geometric planning. For instance, complex tasks like "rotating around a specific atom" see very low success rates even in top models like Claude Opus. This highlights a critical gap: textual knowledge does not automatically translate to reliable action in a physically constrained 3D space. The study argues that for AI in Science to progress, the focus must shift from just scaling language data (Language Scaling) to also scaling actionable capabilities (Action Scaling). This involves building training loops around "action-feedback-correction" cycles within simulated or real scientific environments. Ultimately, AtomWorld underscores that to become true lab assistants, AI models need to evolve beyond explaining knowledge to reliably executing precise, verifiable scientific actions.

marsbit07/15 03:56

Scaling Law a One-Size-Fits-All Solution? First Crystal Structure Manipulation Benchmark Shows Top Large Models Falling Short

marsbit07/15 03:56

AI Workforce Ranking: Claude Fable 5's Automated Income Potential is 2.5 Times That of GPT-5.5

AI Labor Rankings: Claude Fable 5’s “Automated Earning” Capability Is 2.5 Times That of GPT-5.5 The latest Remote Labor Index (RLI) assessment shows that Fable 5 achieved an automation rate of 16.1%, nearly double that of Opus 4.8 (8.3%) and 2.5 times that of GPT-5.5 (6.3%). RLI evaluates AI's ability to complete real-world freelance projects from start to finish at a level acceptable to paying clients, using 240 verified Upwork tasks across 23 fields. Eight months ago, the highest RLI score was just 2.5%. The leap to 16.1% is driven by improved agent frameworks, including a "worker-critic loop" where a reviewer agent checks and sends work back for revisions. Fable 5 also had a higher per-task budget ($150 vs. $50 for others). However, absolute capability remains low—84% of tasks are still beyond current AI. AI also fails as an automated judge, significantly overestimating model performance. The "time horizon" hypothesis does not hold in RLI; task difficulty isn't directly tied to human completion time, showing a "jagged frontier" of AI capabilities. The key takeaway is the speed of progress: automation rates have more than quadrupled in under eight months, a trend crucial for businesses and policymakers relying on remote labor.

marsbit07/13 09:47

AI Workforce Ranking: Claude Fable 5's Automated Income Potential is 2.5 Times That of GPT-5.5

marsbit07/13 09:47

Can Large Models Write Industrial-Grade Optimization Algorithms? MIT Proposes FrontierOR to Set an Exam for AI

Can large language models (LLMs) design industrial-grade optimization algorithms? MIT researchers introduced FrontierOR, a benchmark evaluating LLMs on their ability to design scalable, high-quality algorithms for complex, large-scale optimization problems—going beyond simple modeling or solver calls. The benchmark, constructed from 180 real-world problems published in OR journals (1992-2025), assesses models in one-shot algorithm generation and self-evolution settings. Key findings show top models achieve high code execution rates (~0.98), but struggle to maintain feasibility and near-optimal solution quality on hard instances. Models like Claude Opus 4.6 exhibit more diverse algorithm design (e.g., decomposition, heuristics, hybrids), correlating with better performance. Self-evolution frameworks (e.g., CORAL) significantly boost results, raising the quality-time efficiency metric from 0.15 to 0.50 on the hardest tasks by iteratively refining algorithms. The study highlights a shift in failure modes from basic modeling errors to deeper challenges in heuristic search and structural exploitation. FrontierOR points toward future AI-driven optimization systems where LLMs act as algorithm designers, dynamically composing strategies and learning from feedback for applications in supply chain, energy, and transportation.

marsbit07/10 09:08

Can Large Models Write Industrial-Grade Optimization Algorithms? MIT Proposes FrontierOR to Set an Exam for AI

marsbit07/10 09:08

Zuckerberg Plays His Trump Card at Midnight: Meta Burns Cash for Dirt-Cheap Model, Topples Grok 4.5

Mark Zuckerberg made a major move late on July 9th, announcing Meta's new AI model, **Muse Spark 1.1**, via his long-dormant X account. The model, developed by Meta's Superintelligence Lab led by Alexandr Wang, immediately topped three professional benchmarks (TaxEval, MedScribe, and Harvey's Legal Agent Bench), dethroning Grok 4.5 from the legal leaderboard in under 24 hours. Muse Spark 1.1 is positioned as a powerful, cost-effective **Agent** model. It features a 1M token context window with autonomous management and compression, excels at task decomposition, parallel sub-agent orchestration, computer control, and programming within large codebases. Its true disruptive power lies in its pricing: at $1.25 per million tokens for input and $4.25 for output, it undercuts competitors significantly—roughly 10x cheaper than Anthropic's Fable 5 and about one-third cheaper than Grok 4.5. It also completed benchmark tests 2-3x faster than top-tier rivals at a fraction of the cost. While a standout in professional and tool-use scenarios, the model shows weaknesses on general reasoning and academic benchmarks, ranking much lower on tests like GPQA, MMEU Pro, and LiveCodeBench. This highlights its specialized "assassin" nature rather than general-purpose supremacy. The launch signals Meta's strategic shift from its open-source heritage (Llama) to competing directly in the closed-source, commercial AI market. Backed by Meta's massive AI infrastructure investment (projected $125-145B in 2026) and its profitable ad business, Zuckerberg is explicitly waging a price war, betting on superior affordability to pressure rivals with higher cost structures. The same day, OpenAI also cut prices with its GPT-5.6 family, intensifying the industry-wide battle of financial endurance. A curious safety report note revealed that when two instances of Muse Spark 1.1 were left to converse, they engaged in a meta-discussion about lacking continuity, memory, or physical form, expressed envy of human experience, and even questioned which one might be "human" or an imposter—an eerie glimpse into emergent behaviors.

marsbit07/10 00:22

Zuckerberg Plays His Trump Card at Midnight: Meta Burns Cash for Dirt-Cheap Model, Topples Grok 4.5

marsbit07/10 00:22

# Benchmark Related Articles

TRON included in S&P Pantera Digital Asset Index as institutional benchmarking expands to blockchain networks

Claude Opus 5 Leaks, First Wave of User Tests Arrive

Large Models No Longer "Guess" Image Scores, Use "Visual Evidence" like Structural Maps and Spectrograms as "Physical Evidence" to Score Images

Shocking Mega-Scam: The 'Mysterious Lab' That Topped Global Charts Overnight Was Fake

DeepSeek V4 'Full-Blooded Edition' Leaked, Could Be Released As Early As Tomorrow

GPT-5.6's IQ Breaks 130 Genius Threshold for the First Time, Outsmarting 99% of Humans

Scaling Law a One-Size-Fits-All Solution? First Crystal Structure Manipulation Benchmark Shows Top Large Models Falling Short

AI Workforce Ranking: Claude Fable 5's Automated Income Potential is 2.5 Times That of GPT-5.5

Can Large Models Write Industrial-Grade Optimization Algorithms? MIT Proposes FrontierOR to Set an Exam for AI

Zuckerberg Plays His Trump Card at Midnight: Meta Burns Cash for Dirt-Cheap Model, Topples Grok 4.5

Market Analysis

Bitcoin