- This event has passed.
Roberta Raileanu
April 26 @ 11:30 am - 1:00 pm
Title: Teaching Large Language Models to Reason with Reinforcement Learning
Abstract: In this talk, I will discuss how we can use Reinforcement Learning (RL) to improve reasoning in Large Language Models (LLM), as well as when, where, and how to refine LLM reasoning. First, we study how different RL-like algorithms can improve LLM reasoning. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. However, even with RL fine-tuning, LLM reasoning remains imperfect. Prior work found that LLMs can further improve their reasoning via online refinements. However, in our new work we show that LLMs struggle to identify when and where to refine their reasoning without access to external feedback. Outcome-based Reward Models (ORMs) trained to predict the correctness of the final answer, can indicate when to refine. Process Based Reward Models (PRMs) trained to predict correctness of intermediate steps, can indicate where to refine. But PRMs are expensive to train, requiring extensive human annotations. We introduce Stepwise ORMs (SORMs) which are trained only on synthetic data, to approximate the expected future reward of the optimal policy, or V*. Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy on reasoning tasks. For the question of how to refine LLM reasoning, we find that global and local refinements have complementary benefits, so combining both of them achieves the best results. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53% to 65% when greedily sampled.