AI2 Releases Fully Open-Source Web Agent MolmoWeb: Controlling Web Pages Using Only "Vision"

marsbit2026-03-26 tarihinde yayınlandı2026-03-26 tarihinde güncellendi

Özet

AI2 has released MolmoWeb, a groundbreaking, fully open-source web agent that operates solely by analyzing screenshots, marking a significant leap in vision-driven web navigation. Unlike traditional agents that rely on DOM, MolmoWeb captures and interprets visual data to make decisions—such as clicking, scrolling, or typing—making its process transparent and robust. Despite its compact size (4B and 8B parameters), MolmoWeb performs impressively: it scores 78.2% on the WebVoyager benchmark, nearing OpenAI’s proprietary o3 model (79.3%), and achieves up to 94.7% success with multiple attempts. It even surpasses Anthropic’s Claude3.7 in UI element localization. AI2 also released MolmoWebMix, a massive open dataset with 36K human-browsing tasks, over 2.2M screenshot-QA pairs, and GPT-4o-verified synthetic data. The model and data are fully available on Hugging Face and GitHub under Apache 2.0, promoting transparency and collaboration in AI development. Challenges remain in complex instructions, logins, and legal compliance.

The Allen Institute for Artificial Intelligence (AI2) recently released the groundbreaking fully open-source web agent MolmoWeb . Unlike traditional agents that rely on a webpage's underlying code (DOM), MolmoWeb makes decisions solely by reading screenshots, marking a significant leap forward in "vision-driven" web navigation technology.

Core Technology: "Seeing" Web Pages Like a Human

MolmoWeb's operating logic is very intuitive: it captures a screenshot of the current browser window, decides the next action (such as clicking, scrolling, or paging) through visual analysis, then executes it and repeats. This "what you see is what you get" model makes it more robust than traditional agents because the visual layout of a webpage is generally more stable than its underlying code, and its decision-making process is completely transparent and explainable to human users.

Performance Leap: Small Model Outperforms Giants

Despite having parameter sizes of only 4B and 8B, MolmoWeb demonstrates a "small but mighty" performance:

  • Topping the Charts: In the WebVoyager test, the 8B version scored an impressive 78.2%, not only ranking among the top open-source models but also approaching the performance of OpenAI's proprietary model o3 (79.3%).

  • Huge Potential: Research found that by running tasks multiple times and selecting the optimal result, its success rate could further jump to 94.7%.

  • Precise Localization: In UI element localization benchmark tests, it even surpassed Anthropic's Claude3.7.

Data Support: The Largest Open Dataset to Date

AI2 has not only open-sourced the model weights but also contributed a massive dataset named MolmoWebMix. This dataset contains:

  • 36,000 real browsing tasks completed by human volunteers.

  • Over 2.2 million screenshot-question-answer pairs.

  • Automated synthetic data verified by GPT-4o. Experiments show that synthetic data is even better than human trajectories at guiding the agent to find the "optimal path".

Open-Source Spirit and Future Challenges

Currently, MolmoWeb is fully available under the Apache 2.0 license on Hugging Face and GitHub. Although it still faces challenges in handling complex instructions, login authentication, and legal compliance (such as terms of service), AI2 firmly believes that only through complete transparency and community collaboration can we truly counter the data monopoly of large tech companies.

İlgili Sorular

QWhat is the name of the fully open-source web agent released by the Allen Institute for AI (AI2) that navigates using only screenshots?

AThe web agent is called MolmoWeb.

QHow does MolmoWeb's approach to web navigation differ from traditional web agents?

AUnlike traditional agents that rely on a webpage's underlying code (DOM), MolmoWeb makes decisions by reading and analyzing screenshots, making it a 'vision-driven' technology.

QWhat was the performance score of the 8B parameter version of MolmoWeb on the WebVoyager test, and how does it compare to OpenAI's model?

AThe 8B version scored 78.2% on the WebVoyager test, which is very close to the performance of OpenAI's proprietary model o3, which scored 79.3%.

QWhat is the name of the large, open dataset released alongside MolmoWeb, and what does it contain?

AThe dataset is called MolmoWebMix. It contains 36,000 real browsing tasks completed by human volunteers, over 2.2 million screenshot-QA pairs, and automated synthetic data verified by GPT-4o.

QOn which platforms has MolmoWeb been made available, and under what license?

AMolmoWeb has been fully released on Hugging Face and GitHub under the Apache 2.0 license.

İlgili Okumalar

İşlemler

Spot
Futures
活动图片