AI2 Releases Fully Open-Source Web Agent MolmoWeb: Controlling Web Pages Using Only "Vision"

marsbitPublicado em 2026-03-26Última atualização em 2026-03-26

Resumo

AI2 has released MolmoWeb, a groundbreaking, fully open-source web agent that operates solely by analyzing screenshots, marking a significant leap in vision-driven web navigation. Unlike traditional agents that rely on DOM, MolmoWeb captures and interprets visual data to make decisions—such as clicking, scrolling, or typing—making its process transparent and robust. Despite its compact size (4B and 8B parameters), MolmoWeb performs impressively: it scores 78.2% on the WebVoyager benchmark, nearing OpenAI’s proprietary o3 model (79.3%), and achieves up to 94.7% success with multiple attempts. It even surpasses Anthropic’s Claude3.7 in UI element localization. AI2 also released MolmoWebMix, a massive open dataset with 36K human-browsing tasks, over 2.2M screenshot-QA pairs, and GPT-4o-verified synthetic data. The model and data are fully available on Hugging Face and GitHub under Apache 2.0, promoting transparency and collaboration in AI development. Challenges remain in complex instructions, logins, and legal compliance.

The Allen Institute for Artificial Intelligence (AI2) recently released the groundbreaking fully open-source web agent MolmoWeb . Unlike traditional agents that rely on a webpage's underlying code (DOM), MolmoWeb makes decisions solely by reading screenshots, marking a significant leap forward in "vision-driven" web navigation technology.

Core Technology: "Seeing" Web Pages Like a Human

MolmoWeb's operating logic is very intuitive: it captures a screenshot of the current browser window, decides the next action (such as clicking, scrolling, or paging) through visual analysis, then executes it and repeats. This "what you see is what you get" model makes it more robust than traditional agents because the visual layout of a webpage is generally more stable than its underlying code, and its decision-making process is completely transparent and explainable to human users.

Performance Leap: Small Model Outperforms Giants

Despite having parameter sizes of only 4B and 8B, MolmoWeb demonstrates a "small but mighty" performance:

  • Topping the Charts: In the WebVoyager test, the 8B version scored an impressive 78.2%, not only ranking among the top open-source models but also approaching the performance of OpenAI's proprietary model o3 (79.3%).

  • Huge Potential: Research found that by running tasks multiple times and selecting the optimal result, its success rate could further jump to 94.7%.

  • Precise Localization: In UI element localization benchmark tests, it even surpassed Anthropic's Claude3.7.

Data Support: The Largest Open Dataset to Date

AI2 has not only open-sourced the model weights but also contributed a massive dataset named MolmoWebMix. This dataset contains:

  • 36,000 real browsing tasks completed by human volunteers.

  • Over 2.2 million screenshot-question-answer pairs.

  • Automated synthetic data verified by GPT-4o. Experiments show that synthetic data is even better than human trajectories at guiding the agent to find the "optimal path".

Open-Source Spirit and Future Challenges

Currently, MolmoWeb is fully available under the Apache 2.0 license on Hugging Face and GitHub. Although it still faces challenges in handling complex instructions, login authentication, and legal compliance (such as terms of service), AI2 firmly believes that only through complete transparency and community collaboration can we truly counter the data monopoly of large tech companies.

Perguntas relacionadas

QWhat is the name of the fully open-source web agent released by the Allen Institute for AI (AI2) that navigates using only screenshots?

AThe web agent is called MolmoWeb.

QHow does MolmoWeb's approach to web navigation differ from traditional web agents?

AUnlike traditional agents that rely on a webpage's underlying code (DOM), MolmoWeb makes decisions by reading and analyzing screenshots, making it a 'vision-driven' technology.

QWhat was the performance score of the 8B parameter version of MolmoWeb on the WebVoyager test, and how does it compare to OpenAI's model?

AThe 8B version scored 78.2% on the WebVoyager test, which is very close to the performance of OpenAI's proprietary model o3, which scored 79.3%.

QWhat is the name of the large, open dataset released alongside MolmoWeb, and what does it contain?

AThe dataset is called MolmoWebMix. It contains 36,000 real browsing tasks completed by human volunteers, over 2.2 million screenshot-QA pairs, and automated synthetic data verified by GPT-4o.

QOn which platforms has MolmoWeb been made available, and under what license?

AMolmoWeb has been fully released on Hugging Face and GitHub under the Apache 2.0 license.

Leituras Relacionadas

The Tao (τ) Law Makes EDA Go Viral

In May 2026, Huawei's semiconductor division introduced the "Tao (τ) Law" at IEEE ISCAS, shifting the industry focus from Moore's Law's geometric scaling to "time scaling." Unlike traditional approaches relying on transistor miniaturization, τ Law optimizes the time constant (τ) across device, circuit, chip, and system levels to improve information processing speed and efficiency. Huawei has already applied this principle, mass-producing 381 chips across various applications, with a target to achieve performance equivalent to 1.4nm technology by 2031. The implementation of τ Law, involving techniques like Chiplet, 3DIC, and Logic Folding, places new demands on EDA tools, highlighting gaps in current offerings. Traditional 2D or pseudo-3D EDA flows lack native support for true 3D design, cross-layer co-optimization (STCO), and coupled multi-physics analysis (thermal, power, stress), which are crucial for advanced integration. Chinese EDA companies, such as Empyrean Software, Primarius Technologies, and Xpeedic, are evolving from point-tool specialists to providing full-flow, system-level solutions. For instance, Peking University has developed a prototype "true 3D" EDA tool showing significant improvements in wirelength and timing. Empyrean Software has also launched a comprehensive 3DIC design and verification platform. The τ Law framework presents an opportunity for the domestic EDA industry to transition from achieving basic functionality to developing robust, integrated toolsets essential for next-generation chip design.

marsbitHá 32m

The Tao (τ) Law Makes EDA Go Viral

marsbitHá 32m

It's Not Jensen Huang Who Wants to Change the PC, But the PC That's Revolting Against Itself

The 40-year-old PC industry is undergoing a fundamental transformation, driven by the rise of AI PCs. At the GTC Taipei 2026 event, NVIDIA, backed by Microsoft and major PC OEMs, announced the RTX Spark super chip for Windows PCs, marking its official entry into the PC core processor market. This move aims to redefine the AI PC by shifting its core from the CPU to an AI-focused SoC (System on Chip). NVIDIA envisions the PC evolving from a personal computer to a "personal AI"—a platform where local AI Agents can autonomously perform tasks. While Intel pioneered the AI PC concept earlier in 2026, NVIDIA's aggressive push, leveraging its vast CUDA developer ecosystem of 6 million, positions it to potentially reshape the industry's long-standing Wintel (Windows-Intel) power structure. NVIDIA's strategy extends beyond hardware; it's about embedding its CUDA, RTX, and AI software stack into the PC platform itself. The article identifies key shifts: 1) The move from a CPU-centric to an AI SoC-centric architecture, similar to Apple's approach with its M-series chips. 2) The PC's evolution from a human-operated tool to a platform for human-Agent collaboration. 3) The extension of NVIDIA's data center-centric CUDA ecosystem to personal devices via RTX Spark. Ultimately, the change is driven by the broader trend of AI moving to personal devices. Companies like Intel, AMD, Qualcomm, and Apple are all participating in this shift. NVIDIA's entry accelerates the competition, but the core driver is the technology itself finding its optimal expression in the PC. The industry is reinventing itself, with the outcome hinging on execution, ecosystem development, and the creation of compelling local AI applications.

marsbitHá 2h

It's Not Jensen Huang Who Wants to Change the PC, But the PC That's Revolting Against Itself

marsbitHá 2h

Trading

Spot
Futuros
活动图片