Ai2's MolmoAct 2: The Open Robot Model for the Real World

Credit: Ai2

You’re reading an issue of “The AI Economy,” my newsletter exploring the forces shaping the AI era—tracking how AI is rewriting business, work, technology, and culture. Subscribe to get expert insights and curated updates delivered straight to your inbox.

In the race to build robots that can work reliably in the physical world, proprietary models have held a clear advantage. However, that edge is narrowing. The Allen Institute for AI (Ai2) on Thursday released its next-generation MolmoAct model, which outperforms popular rival, Physical Intelligence’s (PI) π0.5, across simulations, zero-shot real-world tasks, and third-party evaluations. Along with MolmoAct 2’s launch, Ai2 is also making available a massive open-source bimanual tabletop manipulation robotics database called MolmoAct 2-Bimanual YAM, with over 700 hours of training demonstrations.

Introduced in August, MolmoAct is designed to help robots navigate the world with greater spatial awareness before they act. It’s built atop the Molmo LLM, features seven billion parameters, and was trained on 12,000 “robot episodes” from real-world environments. But unlike competitors focused on warehouses and manufacturing, Ai2 targeted the home, training on tasks across kitchens, bedrooms, bathrooms, and living rooms.

“The reception has been excellent,” Ai2 robotics researcher Jiafei Duan tells The AI Economy in an email. “MolmoAct’s full openness gives researchers a rare opportunity to study what it takes to build generalist robotics foundation models, and the academic community has largely focused on using it to investigate the interpretability of such systems.”

He explains that this next-generation MolmoAct is built to support real-world deployment by industry. “It can take on tasks that demand higher precision, speed, and success rates,” such as automating wet labs, bussing tables in cafes, folding laundry at laundromats, and “other meaningful but mundane work.” Notably, Ai2 has already started piloting MolmoAct 2 usage with the Stanford School of Medicine and other research partners.

What Makes MolmoAct 2 Different?

But don’t be fooled: MolmoAct 2 isn’t just a retrained version of its predecessor. Ai2 rebuilt the architecture from the ground up, which results in the model running 37 times faster than the original MolmoAct.

It all starts with a new base model: Molmo 2-ER. This is a specialized, embodied-reasoning variant trained on three million additional examples, designed to sharpen the model’s perception and reasoning about the physical world.

Ai2 reports that Molmo 2-ER outperforms GPT-5, Gemini 2.5 Pro, Qwen3-VL-8B, and GR-ER 1.5 with an average score of 63.8 out of 100 across 13 embodied-reasoning benchmarks.

One benefit of MolmoAct 2 is its improved inference speed. It’s the result of matching Molmo 2-ER with a dedicated action expert that generates robot actions via flow matching. This is connected to the Vision Language Model via a key-value cache bridge—it reuses previously computed information rather than recalculating it from scratch.

MolmoAct 2 also ships with MolmoAct 2-FAST, Ai2’s open-source action tokenizer that converts continuous robot movements into discrete tokens the model can process and learn from. While the robotics community has largely relied on Physical Intelligence’s FAST tokenizer for this up to now, Ai2 says PI hasn’t released the data used to train it. MolmoAct 2-FAST is the nonprofit’s answer to that gap, training data included.

Because of these improvements, Ai2 claims that MolmoAct 2’s inference is “dramatically faster” with a single action call taking 450 milliseconds in the base model and 1,300 milliseconds in MolmoAct 2 with adaptive depth reasoning. By comparison, it would take 6,700 milliseconds (a 14x difference) in MolmoAct when running in the LIBERO benchmark environment with a single NVIDIA H100.

Another fundamental difference lies in how the two generations of MolmoAct handle bimanual manipulation. This is when a robot uses two arms simultaneously to complete a task, just as humans naturally use both hands. With MolmoAct, this capability was possible through per-task fine-tuning. But with MolmoAct 2, it’s native to the base model, meaning it’s ready to work right out of the box—no additional tuning required.

MolmoAct 2 is also designed for more robot embodiments. It can be deployed quickly for three specific robot embodiments, but Duan says it can “be fine-tuned to new embodiments far more easily than its predecessor.”

Ai2 warns that although MolmoAct 2 is highly capable, it has limitations. It admits that the model can struggle when its gripper blocks the camera’s view, when the model can’t respond as quickly as the robot’s control system, or when a task requires fine-grained manipulation.

Out In the Real World

In a departure from previous releases, Ai2 disclosed that MolmoAct 2 is already being used in the wild. An early tester is the Cong Lab at Stanford’s School of Medicine. Under the leadership of Professor Le Cong, researchers are developing a self-driving wetlab to accelerate genome engineering. Ai2 calls this an ideal stress test for robotics models because not only is the environment unstructured, but the tasks require repeat precision, and small errors can accumulate over the course of an experiment.

MolmoAct 2 is used to direct the robot’s arm during routine manipulation steps in CRISPR gene-editing experiments, such as moving samples between stations and operating benchtop equipment. According to Ai2, the Stanford team found that MolmoAct 2 “shows strong potential to streamline key parts of wetlab operations and, in turn, accelerate scientific discovery.”

Creating the Largest Open Robotics Dataset

Existing robotics datasets have posed a problem for the field: proprietary ones are closed off, and open-source alternatives have fallen short of expectations, at least in Ai2’s view. While working on MolmoAct, researcher Jason Lee says that the Open X-Embodiment project wasn’t sufficient for Ai2’s needs, prompting the nonprofit to collect its own data, albeit the dataset consisted of “raw robot action data.”

The team created what it calls “the largest open-source bimanual robotics dataset ever released.” Working with Cortext AI, Ai2 curated MolmoAct 2-Bimanual YAM, a collection of 700 hours of robot demonstrations featuring two robotic arms working together to fold a towel, scan groceries, charge a smartphone, and perform table bussing. It contains over 30 times as much robot data as was used to train the original MolmoAct.

To be clear: MolmoAct 2 wasn’t trained on MolmoAct 2-Bimanual YAM alone. It’s supplemented by other robot datasets used to expose the model to different arms, camera setups, control schemes, and task styles. Some of the integrated ones include large-scale SO-100/SO-101 datasets from low-cost open-source robot arms; filtered DROID Franka data for real-world single-arm manipulation across varied scenes; Google Robot BC-Z and Fractal data from Open X-Embodiment; Bridge WidowX; and, of course, the original training data from MolmoAct.

The MolmoAct 2 Family

MolmoAct 2 comes in two variants: the base model and MolmoAct 2-Think. The latter uses depth-perception tokens—representations that help the model understand how far objects are and where they sit in three-dimensional space—for tasks that benefit from “explicit 3D reasoning.” To save on compute costs, Ai2’s adaptive-depth mechanism routes depth prediction only when it will likely improve task performance. “This enables MolmoAct 2 to reason more deeply about 3D spatial structure while maintaining efficient inference,” the organization writes in a blog post.

In addition, MolmoAct 2-Think focuses depth prediction only on regions with dynamic scene changes rather than every image patch, a selective approach Ai2 says delivers a 17 percent speedup over full depth-token prediction.

MolmoAct 2 Benchmarking

Here’s how MolmoAct 2 performed when evaluated in simulation, zero-shot deployment, and post-training adaptation to new robot settings.

Simulation: Ai2 says it performs “strongly” on its internal manipulation benchmark, MolmoBot. MolmoAct 2 scored an average 20.6 percent success rate across all tasks, which is roughly double that of Physical Intelligence’s π0.5. On the bimanual manipulation benchmark, RoboEval, MolmoAct 2 scored 0.443 while PI’s π0.5 notched 0.405.

Real-world zero-shot tests: Using a Franka arm, Ai2 reports that MolmoAct 2 does better on every task evaluated than PI’s π0.5. For instance, in the apple-to-plate task, MolmoAct 2 succeeded in 100 percent of evaluations. It had an 86.7 percent on the pipette-to-tray task. And when it came to longest-horizon tasks such as moving several objects into a bowl in sequence, Ai2 claims MolmoAct2 had a 62 percent success rate.

Post-training adaptation: MolmoAct 2 performed well during towel folding, bowl placement, table wiping, and tray lifting. Ai2 claims this demonstrates that the model “can be adapted to practical manipulation behaviors via post-training.”

LIBERO: This benchmark measures how well a model can acquire and retain many skills over time. MolmoAct 2 received a 97.2 percent score after post-training—MolmoAct 2-Think scored 98.1 percent. This was a 10.6- and 11.5-point improvement over MolmoAct, respectively.

“In robotics today, the leading open-weight generalist foundation model used by industry is π0.5 from Physical Intelligence, a company backed by billions of dollars in investment,” Duan says. “We’ve shown that we can achieve comparable performance with significantly less data while being more open, making it easy for others to build their workflows on top of our model. I like to think of it as comparable to the broader shift the field saw when DeepSeek arrived on the scene.”

The Quest for the ‘Robot Brain’ Continues

When we spoke last August, Duan raised the idea of developing the “robot brain,” perhaps the embodied equivalent of artificial general intelligence (AGI). He believes it’s the “next frontier” to explore and predicts this year will be about Embodied AI, thanks to an explosion of data that can be used to imbue machines with human-like intelligence.

The release of MolmoAct 2 is another step in the right direction, Duan says. “At least with MolmoAct 2, we are looking at [a] path towards real-world deployment beyond [the] research setting.”

That progress comes despite turbulence at Ai2. In March, CEO Ali Farhadi, COO Sophie Lebrecht, and key researchers Hanna Hajishirzi and Ranjay Krishna all departed for Microsoft, where they joined Mustafa Suleyman’s team. Krishna’s exit is particularly notable given his role as Ai2’s computer vision research lead and the main spokesperson for Molmo and related work.

Ai2 has since tapped Dieter Fox, senior research director and longtime professor at the University of Washington, to lead the robotics team, according to a spokesperson. Despite the staffing change, the nonprofit remains steadfast in MolmoAct’s objectives—to create a general-purpose ARM capable of reasoning in space.

MolmoAct 2, along with Ai2’s MolmoAct 2-Bimanual YAM dataset, MolmoAct 2-FAST, and all the training code, is available for download today.

Featured Image: Credit: Ai2

Ai2’s MolmoAct 2 Is the Open Robot Model Built for the Real World

What Makes MolmoAct 2 Different?

Out In the Real World

Creating the Largest Open Robotics Dataset

The MolmoAct 2 Family

MolmoAct 2 Benchmarking

The Quest for the ‘Robot Brain’ Continues

More from Ken

Live Blog: Samsung Galaxy Unpacked, July 2026

Introducing OnCue: A Live Newsroom for Your Own Site

Zscaler Takes Zero Trust Beyond Human Users to Govern the Rise of AI Agents

Leave a Reply Cancel reply

What Makes MolmoAct 2 Different?

Out In the Real World

Creating the Largest Open Robotics Dataset

The MolmoAct 2 Family

MolmoAct 2 Benchmarking

The Quest for the ‘Robot Brain’ Continues

Live Blog: Samsung Galaxy Unpacked, July 2026

Introducing OnCue: A Live Newsroom for Your Own Site

Zscaler Takes Zero Trust Beyond Human Users to Govern the Rise of AI Agents

Leave a Reply Cancel reply

Discover more from Ken Yeung