AI2 Releases Fully Open-Source Web Agent MolmoWeb: Controlling Web Pages Using Only "Vision"

marsbitPublished on 2026-03-26Last updated on 2026-03-26

Abstract

AI2 has released MolmoWeb, a groundbreaking, fully open-source web agent that operates solely by analyzing screenshots, marking a significant leap in vision-driven web navigation. Unlike traditional agents that rely on DOM, MolmoWeb captures and interprets visual data to make decisions—such as clicking, scrolling, or typing—making its process transparent and robust. Despite its compact size (4B and 8B parameters), MolmoWeb performs impressively: it scores 78.2% on the WebVoyager benchmark, nearing OpenAI’s proprietary o3 model (79.3%), and achieves up to 94.7% success with multiple attempts. It even surpasses Anthropic’s Claude3.7 in UI element localization. AI2 also released MolmoWebMix, a massive open dataset with 36K human-browsing tasks, over 2.2M screenshot-QA pairs, and GPT-4o-verified synthetic data. The model and data are fully available on Hugging Face and GitHub under Apache 2.0, promoting transparency and collaboration in AI development. Challenges remain in complex instructions, logins, and legal compliance.

The Allen Institute for Artificial Intelligence (AI2) recently released the groundbreaking fully open-source web agent MolmoWeb . Unlike traditional agents that rely on a webpage's underlying code (DOM), MolmoWeb makes decisions solely by reading screenshots, marking a significant leap forward in "vision-driven" web navigation technology.

Core Technology: "Seeing" Web Pages Like a Human

MolmoWeb's operating logic is very intuitive: it captures a screenshot of the current browser window, decides the next action (such as clicking, scrolling, or paging) through visual analysis, then executes it and repeats. This "what you see is what you get" model makes it more robust than traditional agents because the visual layout of a webpage is generally more stable than its underlying code, and its decision-making process is completely transparent and explainable to human users.

Performance Leap: Small Model Outperforms Giants

Despite having parameter sizes of only 4B and 8B, MolmoWeb demonstrates a "small but mighty" performance:

  • Topping the Charts: In the WebVoyager test, the 8B version scored an impressive 78.2%, not only ranking among the top open-source models but also approaching the performance of OpenAI's proprietary model o3 (79.3%).

  • Huge Potential: Research found that by running tasks multiple times and selecting the optimal result, its success rate could further jump to 94.7%.

  • Precise Localization: In UI element localization benchmark tests, it even surpassed Anthropic's Claude3.7.

Data Support: The Largest Open Dataset to Date

AI2 has not only open-sourced the model weights but also contributed a massive dataset named MolmoWebMix. This dataset contains:

  • 36,000 real browsing tasks completed by human volunteers.

  • Over 2.2 million screenshot-question-answer pairs.

  • Automated synthetic data verified by GPT-4o. Experiments show that synthetic data is even better than human trajectories at guiding the agent to find the "optimal path".

Open-Source Spirit and Future Challenges

Currently, MolmoWeb is fully available under the Apache 2.0 license on Hugging Face and GitHub. Although it still faces challenges in handling complex instructions, login authentication, and legal compliance (such as terms of service), AI2 firmly believes that only through complete transparency and community collaboration can we truly counter the data monopoly of large tech companies.

Related Questions

QWhat is the name of the fully open-source web agent released by the Allen Institute for AI (AI2) that navigates using only screenshots?

AThe web agent is called MolmoWeb.

QHow does MolmoWeb's approach to web navigation differ from traditional web agents?

AUnlike traditional agents that rely on a webpage's underlying code (DOM), MolmoWeb makes decisions by reading and analyzing screenshots, making it a 'vision-driven' technology.

QWhat was the performance score of the 8B parameter version of MolmoWeb on the WebVoyager test, and how does it compare to OpenAI's model?

AThe 8B version scored 78.2% on the WebVoyager test, which is very close to the performance of OpenAI's proprietary model o3, which scored 79.3%.

QWhat is the name of the large, open dataset released alongside MolmoWeb, and what does it contain?

AThe dataset is called MolmoWebMix. It contains 36,000 real browsing tasks completed by human volunteers, over 2.2 million screenshot-QA pairs, and automated synthetic data verified by GPT-4o.

QOn which platforms has MolmoWeb been made available, and under what license?

AMolmoWeb has been fully released on Hugging Face and GitHub under the Apache 2.0 license.

Related Reads

First Batch of Keynote Speakers and Partners Announced! Web2+3 Summit: Defining the Next Generation of Digital Economy

Web2+3 Summit: Defining the Next Generation of Digital Economy The 6th BEYOND International Technology Innovation Expo (BEYOND Expo 2026), Asia's largest tech and ecosystem exhibition, is launching a dedicated Web2+3 stage for the first time. Co-hosted by BEYOND Expo and ChainNeXT Group, the Web3 Summit will take place from May 28–30, 2026. Against the backdrop of accelerating global tech integration, the boundaries between Web2 and Web3 are rapidly blurring. With clearer global regulations for blockchain-driven internet (Web3) and the special issuance of a Hong Kong dollar stable币 license by the Hong Kong SAR government on April 10, 2026, Web3's decentralized principles are quickly merging with traditional industries (Web2) such as e-commerce, finance, and artificial intelligence. Focused on blockchain-driven digital economy elements, the summit will center on three core principles—implementability, commercial viability, and compliance. It will bring together top Web3 experts to discuss key integration areas like stablecoin payment finance (PayFi), real-world asset tokenization (RWA), and decentralized AI (DeAI), unveiling new opportunities for industrial innovation. The first wave of confirmed speakers includes Jack Kong (Director of Hong Kong Cyberport, Chairman of Nano Labs), Yat Siu (Chairman of Animoca Brands), Michael Wu (Co-founder & CEO of Amber Group), Michael Heinrich (Co-founder & CEO of 0G), and Art Abal (Co-founder of Vana). More Web3 ecosystem pioneers, AI, and fintech experts will be announced soon. Core forum topics include: - Web2+DeAI: New AI Paradigms Driven by Decentralized Infrastructure - Web2+RWA: Real-World Asset Tokenization and Global Liquidity - Web2+PayFi: Cross-Border Payments and Financial Innovation Powered by Crypto Infrastructure - Web2+3 AI: Autonomous Agents and the Crypto Economy - Web2+3 Wealth: On-Chain and Off-Chain Integrated Investment Ecosystems - Web2+3 Commerce: A New Landscape for Global Trade Driven by Stablecoins Additional agenda details will be released in the near future.

marsbit2h ago

First Batch of Keynote Speakers and Partners Announced! Web2+3 Summit: Defining the Next Generation of Digital Economy

marsbit2h ago

Trading

Spot
Futures
活动图片