Ai2’s RewardBench 2 Is a Tougher Benchmark for Testing How Well AI Models Reflect Human Judgment

An AI-generated image of a fictitious dashboard displaying model performance metrics. Image credit: Microsoft Copilot

Ai2 has released an update to its RewardBench benchmark, making it more capable of evaluating reward models. The next-generation test is built using new, more complex examples to assess how accurately AI models can produce answers that are as accurate as those of a human. In its first round of testing, RewardBench 2 ranks Google’s Gemini 2.5 Flash among the leading models.

What are reward models? These are systems used to train AI to make better decisions by scoring or ranking responses based on what humans prefer. In other words, when an AI generates multiple potential answers to a question, the reward model judges them and decides which one is better. Reward modeling is often used in Reinforcement Learning from Human Feedback (RLHF), such as with Ai2’s Tulu 3 and OLMo 2.

RewardBench was first introduced in 2024 and aimed to provide developers with a better understanding of what is being done to align AI models more effectively. While the first-generation benchmark looked at 30 models, this new offering evaluated 70 reward models. Ai2 says this was done to understand better the correlation between “reward model evaluation and downstream performance” and build off of the organization’s Tulu 3 preference data and that from Skywork AI’s open reward models.

To ensure tougher testing, Ai2 claims it curated original human prompts that haven’t been part of any AI training data. This is contrary to other benchmarks, which typically use existing prompts from downstream evaluations. These new unseen prompts are designed for a “best-of-4” reward model evaluation format. In addition, RewardBench 2 looks at six domains to challenge models: math, safety, instruction following, focus, factuality, and ties. The last one is new to RewardBench and tests to see how reward models handle questions with multiple correct answers.

Subscribe to The AI Economy

Here’s what Ai2 found:

  • Leading models on RewardBench 2 scored 20 or more points lower than those on RewardBench
  • Closely-related base models can lead to significant differences in reward model performance and, contrary to popular belief, training for more than one epoch could lead to better results
  • RewardBench 2 scores align well with real-world performance in tasks like best-of-N sampling
  • When it comes to RLHF, compatibility between models matters. In other words, using a reward model from a different model family than the policy model could negatively impact performance, even if it ranks highly on the benchmark

Joining Gemini 2.5 Flash at the top of the initial leaderboard are Google’s Gemma 2 27B, Meta’s Llama 3.1 70B, Anthropic’s Claude Opus 4, Meta’s Llama 3.1 70B Instruct, and Skywork Reward Gemma 2 27B.

By using RewardBench, developers can determine whether AI models are truly delivering thoughtful, accurate answers, rather than relying solely on memorized patterns. The accompanying leaderboard adds transparency, offering deeper insights into which models align more closely with human judgment, an increasingly important factor when choosing between competing LLMs.

Ai2 Chief Executive Ali Farhadi once told me he thinks we’re in an “evaluation crisis.” By examining the tools his non-profit organization has released over the past year, like Tulu 3 and OLMoTrace, it becomes clear how his team aims to shed light on how models perform. RewardBench 2 is another tool in the developer toolkit to help builders better understand the model powering their apps.

Featured Image: An AI-generated image of a fictitious dashboard displaying model performance metrics. Image credit: Microsoft Copilot

Subscribe to “The AI Economy”

Exploring AI’s impact on business, work, society, and technology.

Leave a Reply

Discover more from Ken Yeung

Subscribe now to keep reading and get access to the full archive.

Continue reading