AI2 Releases Fully Open-Source Web Agent MolmoWeb: Controlling Web Pages Using Only "Vision"
AI2 has released MolmoWeb, a groundbreaking, fully open-source web agent that operates solely by analyzing screenshots, marking a significant leap in vision-driven web navigation. Unlike traditional agents that rely on DOM, MolmoWeb captures and interprets visual data to make decisions—such as clicking, scrolling, or typing—making its process transparent and robust.
Despite its compact size (4B and 8B parameters), MolmoWeb performs impressively: it scores 78.2% on the WebVoyager benchmark, nearing OpenAI’s proprietary o3 model (79.3%), and achieves up to 94.7% success with multiple attempts. It even surpasses Anthropic’s Claude3.7 in UI element localization.
AI2 also released MolmoWebMix, a massive open dataset with 36K human-browsing tasks, over 2.2M screenshot-QA pairs, and GPT-4o-verified synthetic data. The model and data are fully available on Hugging Face and GitHub under Apache 2.0, promoting transparency and collaboration in AI development. Challenges remain in complex instructions, logins, and legal compliance.
marsbit03/26 01:39