Ai2 has introduced an addition to its Tulu suite of models to level the playing field between open-source and proprietary closed models in post-training performance. Coming nearly a year after its predecessor, Tulu 3 aims to help models avoid forgetting core skills when undergoing specialized training, such as following instructions, coding, doing math, having knowledge recall, reasoning, remembering safety guardrails, chatting, and being multilinguistic.
It’s like completing your K-12 education and then choosing a major in college. While completing your post-secondary education, you might forget certain subjects or concepts. It’s an analogy that Sophie Lebrecht, Ai2’s Chief Operating Officer, felt was apt.
Open-Source’s Post-Training Problem
OpenAI, Anthropic, and Big Tech companies have more complex processes, vast data, and better evaluation settings to compensate for this. But “none of this is open to the public,” says Hannaneh Hajishirzi, Ai2’s senior director of NLP Research. Tulu 3 makes it accessible to everyone who wants to fine-tune their models.
“Any language models that you interact with, like [OpenAI’s] ChatGPT, [Anthropic’s] Claude, and so on, have gone through multiple stages of training,” she explains. Many of us are familiar with the pre-training phase in which an AI model is trained on a large set of web data, but it’s not ready for prime time. “It’s not able to follow human instructions, is not safe nor robust, and doesn’t even have a lot of these high-quality skills that you would expect from it.”
This is why Ai2 calls post-training critical and challenging.
“The art of post-training is to elicit and enhance the important capabilities that the model is somewhat acquiring through pre-training, but also enhances them, giving them more complex math reasoning abilities, or making them better at coding, and so on. Importantly, it enables these models to follow human instructions and then do the things that we are asking them to do,” Hajishirzi asserts.
However, “when you teach a model, for example, to become a really good coder, it forgets some of its previous capabilities. Now, it can’t write a poem, aren’t able to follow human instructions. We even had one example where we started training a model to answer very complex scientific questions. Then, when we asked, ‘Who is Barack Obama?’, it was hallucinating and making mistakes.”
Though challenging, it’s something some large AI firms have figured out. Hence the reason for Tulu 3’s release: The open-source community “lags behind,” Hajishirzi claims. “We are trying to address these limitations in the open-source community where we started looking at the core skills…Then we started building a really good evaluation framework to guide our platform like the kind of set up goals for ourselves: What do we want to achieve? Arguably, open-source did not have something like this because this is important to guide our process. So, with our evaluation framework, we are releasing a list of benchmarks, the toolkit that facilitates the whole implementation, and the experimentation pipeline.”
Tulu 3: What You Need to Know
Lebrecht describes the new framework as “a real shift for the [open-source] community” because “now anybody can post-train a model as good as the closed-sourced ones.”
Tulu 1 + Tulu 2 = Tulu 3
Tulu 3 is available in two model sizes: 8 billion parameters and 70 billion parameters. It’s an evolution of Tulu 1 and Tulu 2. “We did a lot of investigations on data, training, and evaluation,” Hajishirzi declares.
First, Ai2 collected different types of data targeted towards a model’s core capabilities, ensuring that it was diverse, high-quality, and open-sourced. Then, after decontaminating it against an evaluation framework, the data was mixed and matched before undergoing supervised fine-tuning. This is what made up the first-generation Tulu.
Next, there’s preference tuning, sometimes called Reinforcement Learning From Human Feedback (RLHF). Hajishirzi describes it as “where you look at two kinds of completions given one prompt…and you want to see which one is better.” It’s when the model is taught how to answer things better. Tulu uses a variation called Direct Preference Optimization (DPO). It also includes new infrastructure and Proximal Policy Optimization (PPO) training, a popular Reinforced Learning algorithm. All of this is baked into Ai2’s second-generation framework.
Today, Tulu 3 incorporates the concepts of supervised fine-tuning and DPO. In addition, Ai2 has incorporated Reinforced Learning with Verifiable Rewards. It’s an algorithm targeting specific skills where the model is “working on the type of tasks that we can kind of execute them and verify if the outcome is true or not. Like, you solve a math problem. Now, you have the final answer. Did you get it correctly or not? If yes, then you boost that.”
A Recipe to Follow
For companies without the means or the know-how about all the steps needed to do this post-training work, Tulu 3 is billed as a handy recipe for them to follow. Not only does it provide data around core skills such as the aforementioned coding, math, multilingual, safety, chat, reasoning, and instruction following, but it also features three data toolkits, the training code and infrastructure, and checkpoints.
As Hajishirzi puts it, developers can use their data, identify the skill(s) they want to add to the system, and see what’s available within Tulu 3. From there, they can mix and match, electing for their model to be good at math but not knowledge recall and instruction following, for example. Developers would follow the recipe and continue evaluating the model to see if it delivers the expected outcome.
“If you take Tulu out of the box, if you follow these tips, tricks, data mix and recipes, you can be as good as the closed models,” Lebrecht claims. And using Tulu 3 may save companies money, since it’s plug and play, she asserts there won’t be a need to fundraise “a ton of money” for compute. “This is why it’s going to be a huge game changer in terms of people able to create these really high-quality, task specific models.”
Although Tulu 3 today supports eight core skills, Ai2 plans to add more, such as answering complex scientific questions. Since it’s open-sourced, developers can develop their own, and some are available on Hugging Face.
Tulu 3 Benchmarks
Even though Ai2 Chief Executive Ali Farhadi calls evaluations “bogus,” the company claims Tulu 3 performs better than some prominent models including OpenAI’s GPT-3.5 and GPT-4o-Mini, and Anthropic’s Claude 3.5-Haiku across 12 tasks. It even does better than its two predecessors.
Hajishirzi discloses that Ai2 applied Tulu 3 to Meta’s Llama 8 billion and 70 billion parameter models and is planning to do the same to other base models. One of those is Ai2’s OLMo, which launched in February. “OLMo Tulu is going to be a fast follow,” Lebrecht admits, boasting that initial results are “really good,” although more testing still needs to be done.
Currently, Tulu 3 isn’t connected with any tools, although Hajishirzi acknowledges that’s something Ai2 wants to do, including with agents and retrieval augmented generation (RAG) tools.
Who is Tulu 3 For?
Though the framework is available for any developer, Ai2 believes it’ll benefit those in research, the public sector, healthcare, academia, and regulated industries most. To Lebrecht, Tulu 3 can help companies that need data traceability, meaning that when their information comes under investigation, they’ll have a chance of identifying where it’s going.
She also notes Tulu 3 will help us realize AI’s full potential: “People are using the closed models in a sandbox. This is the challenge. We’re not seeing the full promise of the economic impact we thought with AI because people are dabbling. If you’re in a regulated industry, you must have human verification because you can’t fully unleash generative AI in many cases. So we see adoption, but it’s often restricted to pilots or sandboxes, whereas this has the full potential that we’ll be able to be fully deployed at scale because it will be certified for safe use.”
A New Post-Training Era
Nathan Lambert, a machine learning researcher at Ai2, notes that the evolution of post-training has undergone cycles of innovation and stagnation. Unfortunately, despite their early progress with Aplaca, Vicuna, Koala, and Dolly, the open-source community has struggled to keep pace with its closed counterparts. Skepticism and doubt around RLHF soon followed, even though OpenAI highlighted it as integral to ChatGPT’s success.
RLHF lost the favor of the open-source community due to its high data budgets, which ranged from $100,000 to $1 million. Lambert explains that those who did embrace it early on found success.
The introduction of DPO in 2023 would revitalize interest, leading to Tulu 2 and Zephyr-Beta. Preference-tuning defined this era as being an essential component for producing competitive models. Open-source development would eventually plateau thanks to constrained resources and datasets.
Companies building closed models would continue to innovate on post-training, turning it into complex, multi-stage processes with instruction tuning, RLHF, and more. With Tulu 3, Ai2 hopes to reignite interest in post-training among its open-source brethren to spur more innovation.
“Post-training has been heavily guarded, kept under lock and key because that makes the models usable and sellable,” Lebrecht says. “And so…we’ve got these generalized capabilities at pre-training, but how do we take them and map them to use cases? Because that’s going to be the most value. And so how they do that is in the post-training stage, and that’s what’s being kept completely under lock and key. And this is where Tulu will, in some ways, just completely rock that. Because now anybody has what they need to perform at that level.”
Ai2 has set up a playground for those interested in a demo. You can also find Tulu 3 on Hugging Face.
Featured Image: An infographic detailing the Tulu 3 model recipe and artifacts. Image credit: Ai2
Leave a Reply
You must be logged in to post a comment.