Ai2 Launches MolmoWeb for Open-Source Web Agents

Credit: Ai2

You’re reading an issue of “The AI Economy,” my newsletter exploring the forces shaping the AI era—tracking how AI is rewriting business, work, technology, and culture. Subscribe to get expert insights and curated updates delivered straight to your inbox.

The race to build AI agents that can actually use the web is no longer confined to Big Tech or Big AI. On Tuesday, the Allen Institute for AI stepped into the arena with MolmoWeb, an open visual web agent designed to navigate and control a browser much like a human would. Paired with MolmoWebMix—a large training dataset for web-based tasks—the release signals something bigger than another entrant in a crowded field dominated by OpenAI, Anthropic, Google, and Microsoft. It’s an attempt to open up the underlying stack of web agents, not just compete on the surface.

While this may be considered Ai2’s first web agent, it’s not the first bot from the AI nonprofit founded by the late Paul Allen. Along with MolmoWeb, there are DR Tolu (long-form, deep-research tasks), Sera (coding), and Asta (scientific research).

MolmoWeb: Look at the Screen, Decide, and Do

MolmoWeb runs on Molmo 2, Ai2’s most recent open multimodal model. Think of it as an extension of the LLM’s capabilities, bringing captioning, visual reasoning, and grounding language in images to browser control. Developers can choose between two variations—4B and 8B—offering a trade-off between performance and efficiency.

“As we have seen with LLMs, larger models tend to perform better,” Tanmay Gupta, Ai2’s senior research scientist, tells The AI Economy in an email. “MolmoWeb-8B is a more performant model compared to the 4B model. However, in resource-constrained settings, users may find the 4B model equally effective for simpler tasks yet more efficient. Both models were trained on identical training data.”

When used, MolmoWeb performs a simple loop: it looks at the screen, decides what to do, and does it. It receives task instructions at each step, then produces a short natural-language thought outlining its reasoning before executing a browser action. MolmoWeb can navigate URLs, click at screen coordinates, type into text fields, scroll through pages, open or switch browser tabs, and send messages back to the user.

The choice of available actions capitalizes on what Molmo 2 can do: They operate in the browser viewpoint, and the click locations are represented “as normalized coordinates and converted to pixels when executed.” Ai2 claims that doing so means MolmoWeb can interact with websites just like humans do. It’s not beholden to HTML, accessibility trees, or other structured page representations, just any visual layout.

“Working from screenshots brings practical advantages,” Ai2 writes in a blog post. “A single screenshot is far more compact than a serialized page representation, which can consume tens of thousands of tokens. Visual interfaces also remain stable even when underlying page structures change, and because the model reasons about the same interface the user sees, its behavior is easier to interpret and debug.”

As a result, MolmoWeb can carry out a wide range of everyday web tasks without relying on dedicated APIs.

While Ai2’s new agent comes pre-equipped with capabilities, developers can develop their own. Gupta thinks some common use cases could include automating those everyday browser workflows, “especially those that need to be run at a predictable frequency,” such as getting information from a website at a fixed time every week, or “those that need to scale to many queries (e.g., getting author h-index for 100s of authors by iterating over the author list and querying MolmoWeb for each author at a time).”

He adds that Ai2 did not train MolmoWeb to handle sensitive tasks involving personally identifiable information, logins, passwords, or financial transactions. This safety precaution suggests the team is confident that MolmoWeb is unlikely to experience an OpenClaw-style security incident. That said, Gupta recommends that if developers opt to self-host MolmoWeb, they should “refrain from prompting the model with tasks that require sensitive personal information, such as usernames, passwords, and credit card details.”

How To Train Your Own Web Agent

“One major challenge in building web agents is the lack of public training data,” Ai2 states. This is where MolmoWebMix fits in—it’s a large, open-source dataset that blends human web interactions, simulated task sequences, and visual interface data, while designed for multimodal web agents.

The dataset includes 30,000 human-completed web tasks—the largest publicly released collection to date—and spans over 590,000 individual actions across 1,100 websites.

Ai2 then generated an additional set of tasks using automated agents that operated on webpage accessibility trees. It used a mix of single-agent tasks, multi-agent teamwork, and systematic site exploration to create a large, diverse dataset of web activity—all without any human involvement.

Lastly, MolmoWebMix features training data that teaches the model how to interpret webpage screenshots. It includes element-grounding tasks, such as identifying where a UI element may appear on a screen, along with screenshot question-answering tasks that require reading and reasoning about page content. Ai2 discloses that the screenshot QA portion has over 2.2 million question-answer pairs from 400 websites.

Evaluating the Web Agent

MolmoWeb was evaluated against the WebVoyager, Online-Mind2Web, DeepShop, and WebTailBench benchmarks. Overall, Ai2 finds that both the 4B and 8B “achieve state-of-the-art results among open-weight web agents.” More specifically:

MolmoWeb-8B outperformed leading open-weight models like Fara-7B across all benchmarks.
MolmoWeb-4B fared just as well against Fara-7B under the DeepShop test for matching step budgets.
MolmoWeb-8B grounding model outperforms Fara-7B, Claude 3.7, and OpenAI CUA on ScreenSpot and ScreeSpot v2 benchmarks.
MolmoWeb-4B also scores competitively as a general web agent while handling full task completion.

When measured against agents built on larger proprietary models, such as OpenAI’s now-retired GPT-4o, that rely on annotated screenshots and structured page data, Ai2 reports MolmoWeb shines there as well. The lab calls it a “striking result given that those models enjoy substantially richer input representations and orders-of-magnitude higher parameters.”

But despite the successful evaluations, Ai2 concedes that its newest web agent has several limitations. First, it’s prone to hallucinating when reading text on screenshots. The bot can also be misdirected by incorrect actions (e.g., scrolling before a page has finished loading and missing relevant content). Ambiguous information or too many constraints can degrade performance. And lastly, the lab reiterates that MolmoWeb has not been trained to handle tasks involving PII, logins, or financial transactions.

The Growing Molmo Timeline and Changes at Ai2

MolmoWeb and MolmoWebMix come weeks after the organization unveiled another extension of the Molmo family, Molmobot. Focused on physical AI, this agent made zero-shot sim-to-real transfer a reality.

But MolmoWeb’s debut comes during a shake-up within Ai2. Earlier this month, its top two officials, Chief Executive Ali Farhadi and Chief Operating Officer Sophie Lebrecht, resigned. And on the eve of today’s announcement, it was revealed they’ve been tapped by Microsoft to work under Mustafa Suleyman. Moreover, Hanna Hajishirzi and Ranjay Krishna, two key Ai2 researchers who worked on the Olmo and Molmo models, respectively, are joining them.

Ai2 has since named Peter Clark, a founding member, as acting CEO.

In any event, MolmoWeb, along with its training data, evaluation tools, and an inference library for local model execution, can all be downloaded from Hugging Face and GitHub today.

“MolmoWeb represents a step in an exciting scientific direction—pushing multimodal models beyond passive understanding of images towards systems that can act on what they see,” Ai2 concludes in its blog post.

Featured Image: Credit: Ai2

Ai2 Launches MolmoWeb to Challenge Closed Web Agents with Open Source

MolmoWeb: Look at the Screen, Decide, and Do

How To Train Your Own Web Agent

Evaluating the Web Agent

The Growing Molmo Timeline and Changes at Ai2

More from Ken

Live Blog: Samsung Galaxy Unpacked, July 2026

Introducing OnCue: A Live Newsroom for Your Own Site

Zscaler Takes Zero Trust Beyond Human Users to Govern the Rise of AI Agents

Leave a Reply Cancel reply

MolmoWeb: Look at the Screen, Decide, and Do

How To Train Your Own Web Agent

Evaluating the Web Agent

The Growing Molmo Timeline and Changes at Ai2

Live Blog: Samsung Galaxy Unpacked, July 2026

Introducing OnCue: A Live Newsroom for Your Own Site

Zscaler Takes Zero Trust Beyond Human Users to Govern the Rise of AI Agents

Leave a Reply Cancel reply

Discover more from Ken Yeung