Ai2’s Tulu 3 405B Pushes Open-Weight AI Forward, Challenging DeepSeek, OpenAI Models

A futuristic visualization of an expanding neural network, symbolizing the scaling of AI models from smaller to massive 405B parameters. Image credit: Adobe Firefly

The nonprofit formerly known as the Allen Institute for Artificial Intelligence has released a much more considerable variation of its Tulu 3 model. In addition to the 8 billion and 70 billion parameter versions introduced in November 2024, there’s now a 405 billion option. Ai2 calls this iteration the “first application of fully open post-training recipes to the largest open-weight models.”

What does that mean? The company shares that typically, when a model is released, it’s accessible through an API such as ChatGPT or Anthropic’s Claude, or the model weights are provided. The training recipes used are withheld. Ai2’s process runs counter to this—it has shared all the details about Tulu 3 405B’s training from the exact datasets and codes and commands to the weights and access.

“This makes the post-training of it fully open, in contrast to releases such as DeepSeek, which do share weights and findings, but do not release their data mixtures or hyperparameter settings. This allows researchers and practitioners to better study and train language models in the future,” Ai2 informs me.

Evaluations provided by the organization suggest Tulu 3 405B performs competitively or better than DeepSeek v3 and OpenAI’s GPT-4o and outperforms similar other open-weight post-trained models such as Meta’s Llama 3.1 405B Instruct and Nous Hermes 3 405B.

How Ai2's Tulu 3 405B model performed compared to peer 405B models. Image credit: Ai2
How Ai2’s Tulu 3 405B model performed compared to peer 405B models. Image credit: Ai2

Although Ai2’s model comes weeks after buzzworthy DeepSeek’s v3 announcement, I’m told that Ai2 started work on its 405B model in early December and then continued to fine-tune until its launch at the end of January.

A Tulu 3 Refresher

Called a “real shift for the [open-source] community” because “now anybody can post-train a model as good as the closed-sourced ones,” Tulu 3 aims to make advanced AI training and evaluation more accessible beyond Big AI providers. Because the model is open-sourced, any company can now execute post-train their AI just like OpenAI and Anthropic.

An infographic detailing the pre- and post-training process for language models. Image credit: Ai2
An infographic detailing the pre- and post-training process for language models. Image credit: Ai2

“Any language models that you interact with, like [OpenAI’s] ChatGPT, [Anthropic’s] Claude, and so on, have gone through multiple stages of training,” Hannaneh Hajishirzi, Ai2’s senior director of NLP Research, explained in November. Many of us are familiar with the pre-training phase in which an AI model is trained on a large set of web data, but it’s not ready for prime time. “It’s not able to follow human instructions, is not safe nor robust, and doesn’t even have a lot of these high-quality skills that you would expect from it.”

Researchers in the public sector, healthcare, academia, and regulated industries—those who require data traceability—will likely benefit the most from Tulu 3. That being said, because Tulu 3 405B is a very large model, Ai2 remains uncertain about the use case, saying it won’t realize the potential until the model is hosted in APIs such as Meta’s Llama 405B Instruct.

And speaking of Meta, Ai2 discloses that Tulu 3 405B utilizes Meta’s Llama base, which isn’t fully open. This is a departure from the other Tulu 3 variations.

Nevertheless, the company asserts that this release underscores that “open models are going to continue to improve at a rapid rate and can be competitive with other closed models. While maybe not the model that most people load onto their local infrastructure, it is a sign for what is to come.”

Now that Ai2 has three Tulu 3 variations, the company has created what appears to be a scalable suite of models any company can use for post-training, no matter how small or large its data set.

Making RLVR More Accessible

More importantly, Ai2 highlights that its new release demonstrates the positive impact of its Reinforcement Learning from Verifiable Rewards (RLVR) framework on performance. “Our successes with scaling RLVR reinforce our belief that we were early to this new form of training as popularized with reasoning models such as R1 and O1,” a company spokesperson tells me. “This reinforces our belief that smaller and more open players can participate in the frontier of AI development and we are excited to explore this along with other contributors of the open ecosystem.”

First added to the smaller Tulu 3 models, RLVR is an algorithm that targets skills like math and following human instructions. It applies reinforcement learning to tasks a model can execute in which the verified outcome is true or false. If the result is correct, it’s boosted, and the model is fine-tuned. The goal is to enhance Tulu 3’s capabilities beyond what was achieved in earlier fine-tuning efforts.

Ai2 says the team has learned that RLVR can be applied to Tulu 3 405B without substantive tweaking. “The 405B model highlights how our reinforcement learning with verifiable rewards training can be even more impactful with a powerful base model.”

Tulu 3 405B and its model siblings can be found on Hugging Face. In addition, you can try out a demo on the Ai2 playground.

Featured Image: A futuristic visualization of an expanding neural network, symbolizing the scaling of AI models from smaller to massive 405B parameters. Credit: Adobe Firefly

Leave a Reply

Discover more from Ken Yeung

Subscribe now to keep reading and get access to the full archive.

Continue reading