# Сопутствующие статьи по теме Benchmark

Новостной центр HTX предлагает последние статьи и углубленный анализ по "Benchmark", охватывающие рыночные тренды, новости проектов, развитие технологий и политику регулирования в криптоиндустрии.

Embodied Intelligence Breakthrough: Amap Fully Open-Sources Universal Robot Base Model ABot-M0

Embodied Intelligence Breakthrough: AutoNavi Open-Sources Universal Robot Base Model ABot-M0 AutoNavi has announced the full open-source release of ABot-M0, the world's first unified architecture-based embodied manipulation base model. This model is designed to enable "one general brain to adapt to multiple forms of robots," aiming to break down barriers between heterogeneous hardware and accelerate the adoption of embodied intelligence in industrial and household settings. ABot-M0 demonstrated exceptional performance in industry tests, achieving a task success rate of 80.5% on the Libero-Plus benchmark—a nearly 30% improvement over the previous benchmark, Pi0. It also set new state-of-the-art records on benchmarks like Libero and RoboCasa. The open-source release addresses long-standing challenges in the field, such as data isolation and deployment difficulties, by providing resources across three key dimensions: - **Data:** The UniACT dataset, the largest of its kind, with over 6 million real operation trajectories and full data pipeline tools. - **Algorithm:** The model architecture and training framework, featuring innovative components like Action Manifold Learning (AML) and a dual-stream perception architecture. - **Model:** End-to-end pre-trained models and a complete toolchain for out-of-the-box deployment, significantly lowering the barrier to adaptation. According to AutoNavi's ABot-M0 technical lead, this open-source initiative aims to build a bridge between academic research and industrial application, enabling robots of various forms to possess a smart, reliable, and universal "brain."

marsbit04/01 08:19

Embodied Intelligence Breakthrough: Amap Fully Open-Sources Universal Robot Base Model ABot-M0

marsbit04/01 08:19

AI2 Releases Fully Open-Source Web Agent MolmoWeb: Controlling Web Pages Using Only "Vision"

AI2 has released MolmoWeb, a groundbreaking, fully open-source web agent that operates solely by analyzing screenshots, marking a significant leap in vision-driven web navigation. Unlike traditional agents that rely on DOM, MolmoWeb captures and interprets visual data to make decisions—such as clicking, scrolling, or typing—making its process transparent and robust. Despite its compact size (4B and 8B parameters), MolmoWeb performs impressively: it scores 78.2% on the WebVoyager benchmark, nearing OpenAI’s proprietary o3 model (79.3%), and achieves up to 94.7% success with multiple attempts. It even surpasses Anthropic’s Claude3.7 in UI element localization. AI2 also released MolmoWebMix, a massive open dataset with 36K human-browsing tasks, over 2.2M screenshot-QA pairs, and GPT-4o-verified synthetic data. The model and data are fully available on Hugging Face and GitHub under Apache 2.0, promoting transparency and collaboration in AI development. Challenges remain in complex instructions, logins, and legal compliance.

marsbit03/26 01:39

AI2 Releases Fully Open-Source Web Agent MolmoWeb: Controlling Web Pages Using Only "Vision"

marsbit03/26 01:39

Xiaomi and MiniMax Unleash Major Upgrades Simultaneously, Officially Kicking Off the Agent Pricing War

Chinese AI companies MiniMax and Xiaomi's MiMo have both launched major Agent-focused models, M2.7 and V2-Pro, respectively, within two days in March. Both models rank in the top tier globally on Agent benchmarks but are priced significantly lower than leading Western models—MiniMax at $1.2 per million tokens (1/21 of Claude Opus) and MiMo at $3 (1/8 of Claude Opus). The two represent divergent technical strategies. MiMo-V2-Pro adopts a scale-driven approach with over 1 trillion parameters and a hybrid attention mechanism optimized for long-context and multi-tool agent tasks. In contrast, MiniMax’s M2.7 uses a self-iterative optimization method, autonomously refining its architecture over 100+ cycles to improve performance without disclosing parameter count. Their release rhythms also differ: MiniMax iterates rapidly with four versions in five months, while Xiaomi releases fewer but more substantial upgrades. Notably, Xiaomi debuted V2-Pro anonymously on OpenRouter as "Hunter Alpha," topping the platform’s usage chart before revealing its identity—a first for a Chinese AI model gaining global developer traction through pure performance.

marsbit03/20 08:03

Xiaomi and MiniMax Unleash Major Upgrades Simultaneously, Officially Kicking Off the Agent Pricing War

marsbit03/20 08:03

ChangeNOW Is Settling Crypto Swaps in Under a Minute.

Based on Swapzone's 2026 speed benchmarks, ChangeNOW has established a dominant lead in non-custodial crypto swap speeds. While the industry median for a USDT-to-ETH swap is 45 minutes, ChangeNOW completes the same transaction in under 60 seconds—a 45x difference. This speed is critical as it minimizes the risk of price movements during settlement, ensuring users get the rate they see. The company attributes its performance to infrastructure-level optimizations in liquidity routing, aiming to make near-instant settlement a new industry standard for user trust.

bitcoinist03/06 21:32

ChangeNOW Is Settling Crypto Swaps in Under a Minute.

bitcoinist03/06 21:32

Founders Fund, Pantera, and Franklin Templeton Join Sentient's 'Arena' to Stress-Test Enterprise-Grade AI Agents

Sentient Labs has launched Arena, a real-time, production-ready environment designed to stress-test and iteratively improve enterprise AI agents through competitive challenges. The platform addresses the growing need for reliable, explainable, and reproducible reasoning in high-stakes business workflows such as finance, compliance, and customer operations. Initial participants include Founders Fund, Pantera, and Franklin Templeton, which manages over $1.5 trillion in assets. Arena simulates complex, messy real-world scenarios—incomplete information, long contexts, ambiguous instructions, and conflicting sources—to evaluate not just correctness but full reasoning traces. This allows engineering teams to diagnose failures and track improvements over time. The first challenge focuses on document reasoning, a foundational task for areas like financial analysis and investigative research. Other participants include alphaXiv, Fireworks, OpenHands, and OpenRouter. The initiative comes as 85% of enterprises aim to become "agentic enterprises," but few have mature governance frameworks. Arena provides a vendor-agnostic benchmark to help transition AI agents from demos to production-scale reliability.

marsbit02/27 13:28

Founders Fund, Pantera, and Franklin Templeton Join Sentient's 'Arena' to Stress-Test Enterprise-Grade AI Agents

marsbit02/27 13:28

AI Models Are Evolving Rapidly, How Can Workers Overcome 'AI Anxiety'?

AI models and tools are evolving rapidly, creating a sense of anxiety among professionals who feel pressured to keep up. The root of this "AI anxiety" isn't the pace of change itself, but the lack of a filter to distinguish what truly matters for one's work. Three key forces drive this anxiety: the AI content ecosystem thrives on urgency and hype, loss aversion makes people fear missing out, and too many options lead to decision paralysis. The solution is not to consume more information, but to build a personalized filtering system. "Keeping up" doesn't mean testing every new tool on day one; it means having a system to automatically answer: "Is this important for *my* work?" Three practical strategies are proposed: 1. **Build a "Weekly AI Digest" Agent:** Use automation (e.g., n8n) to gather news from trusted sources, then use an AI to filter it based on your specific job role and tasks. This delivers a concise weekly report of only the relevant updates. 2. **Test with *Your* Prompts:** When a new tool seems relevant, test it using your actual work prompts, not the vendor's perfect demos. Compare the results side-by-side with your current tools to see if it's truly better for your workflow. 3. **Distinguish "Benchmark" vs. "Business" Releases:** Most announcements are "benchmark releases" (improvements on standardized tests) that have little real-world impact. Focus only on "business releases" that offer new capabilities you can use immediately. Combining these strategies transforms AI updates from a source of stress into a manageable advantage. The real competitive edge lies not in accessing every new model, but in knowing what to ignore and what to test deeply for your specific work. The key is to stop trying to follow everything and start filtering for what truly matters.

marsbit02/09 12:19

AI Models Are Evolving Rapidly, How Can Workers Overcome 'AI Anxiety'?

marsbit02/09 12:19

Just 6 Days After Launching ChatGPT Health, OpenAI Is Surpassed on Its Own Medical Benchmark

In a significant development in the AI healthcare sector, Baichuan Intelligence has surpassed OpenAI's GPT-5.2 High on the HealthBench benchmark—a medical evaluation dataset created by OpenAI with input from 260+ doctors across 60 countries—just six days after OpenAI launched ChatGPT Health. Baichuan's new model, Baichuan-M3, achieved a top score of 65.1 and also led in the more challenging HealthBench Hard subset, while demonstrating the lowest hallucination rate (3.5%) without relying on external tools. Key to M3’s performance is its Fact Aware RL technique, which improves diagnostic accuracy by balancing factual precision with proactive questioning. The model avoids both over-confident errors and overly vague responses. Additionally, Baichuan introduced SCAN-bench, a new evaluation framework designed to simulate real doctor-patient interactions. In tests, M3 outperformed human specialists in areas like safety stratification, clarity, and diagnostic questioning, partly due to its ability to integrate knowledge across medical disciplines. Baichuan is now rolling out the model via its consumer product Baixiaoying (百小应), offering tailored interfaces for both doctors and patients. The company emphasizes a focus on "serious medicine," prioritizing complex areas like oncology over general wellness, aiming to augment—not just assist—medical professionals. According to CEO Wang Xiaochuan, enhancing AI’s capability in high-stakes medical scenarios is crucial for building user trust and advancing toward AGI through deeper biological understanding.

marsbit01/14 02:31

Just 6 Days After Launching ChatGPT Health, OpenAI Is Surpassed on Its Own Medical Benchmark

marsbit01/14 02:31

Crypto Gets A Wall Street Upgrade As Nasdaq And CME Deepen Ties

Nasdaq and CME Group have relaunched the Nasdaq Crypto Index as the Nasdaq-CME Crypto Index (NCI), a regulated benchmark designed to support crypto-based investment products like ETFs. Jointly overseen by both exchanges and calculated by CF Benchmarks, the NCI aims to provide institutional investors with a transparent, rules-based measure of the cryptocurrency market. It tracks a diversified basket of major coins rather than a single asset, reflecting traditional index management practices. The initiative is part of a broader collaboration between Nasdaq’s indexing expertise and CME’s trading platform, with a phased rollout beginning in late 2025 and continuing into January 2026.

bitcoinist01/11 15:08

Crypto Gets A Wall Street Upgrade As Nasdaq And CME Deepen Ties

bitcoinist01/11 15:08

Nasdaq and CME Rebrand Joint Crypto Index to Expand Exposure

Nasdaq and CME Group have rebranded their joint crypto index to the Nasdaq-CME Crypto Index, deepening their long-standing collaboration. The updated index serves as a benchmark for investors seeking diversified exposure beyond single-asset crypto strategies, reflecting growing regulatory clarity and institutional participation. It tracks multiple cryptocurrencies, including Bitcoin, Ether, XRP, Solana, Chainlink, Cardano, and Avalanche, representing a broader market view rather than a Bitcoin-centric approach. The index emphasizes governance, transparency, and institutional risk compliance, with eligibility rules, liquidity thresholds, and quarterly rebalancing. This initiative builds on the decades-long partnership between Nasdaq and CME, which began with Nasdaq-100 futures in the 1990s.

TheNewsCrypto01/10 12:41

Nasdaq and CME Rebrand Joint Crypto Index to Expand Exposure

TheNewsCrypto01/10 12:41

Vaults, Yields, and the Illusion of Safety: The Real-World Benchmark

Vaults in crypto have evolved beyond simple yield farming tools into programmable portfolio wrappers, yet they remain dangerously misunderstood. While often marketed as yield-generating products, their economic essence is risk. This article reframes vaults as API-wrapped portfolios that embed various strategies—lending, leverage, credit underwriting—while obscuring the actual risks users assume: smart contract risk, counterparty risk, basis risk, and more. Historical data from traditional finance reveals a persistent risk-return ladder: cash/T-bills (~3.3%), investment-grade bonds (~4.5%), high-yield bonds (~7.8%), equities (~9.9%), and private equity/VC (12%+). These returns compensate for specific risks like credit default, liquidity, volatility, and complexity. DeFi’s focus on APY alone ignores this reality, encouraging riskier strategies to compete for yield, leading to systemic fragility (e.g., Stream, Elixir collapses). The solution is to adopt a risk-aware framework: categorize vaults by their risk profile (cash-like, credit, equity, etc.) and transparently communicate the sources of yield and associated risks. This shift is critical for both user protection and institutional adoption.

深潮12/19 02:59

Vaults, Yields, and the Illusion of Safety: The Real-World Benchmark