Details
Abstract: We explore two complementary frameworks that harness reinforcement learning (RL) to unlock the full potential of large language models (LLMs) for “reasoning” (rationale generation) and “acting” (decision making).
- The first part introduces RAFA — a prompting mechanism that enables autonomous LLM agents to “reason for the future, act for now” in complex long-horizon decision-making tasks. In particular, we implement the reasoning process as learning and planning in Bayesian-adaptive Markov decision processes (MDPs), where the learning and planning subroutines employ in-context learning to emulate actor-critic updates. By guiding short-term acting with long-term reasoning, RAFA achieves provable regret guarantees while outperforming most existing approaches.
- The second part introduces BRiTE — a training paradigm that enhances the reasoning ability of LLMs. In particular, we view the reasoning process via a latent variable model, which yields a bootstrapping strategy that resembles the expectation-maximization (EM) algorithm. Here, the “E-step” samples rationales from the base model via RL in token-level MDPs, while the “M-step” distills rationales into the base model via supervised learning (SL). Through the RL-SL update, BRiTE enjoys provable convergence guarantees while providing significant empirical improvements, especially on various challenging benchmarks, without requiring expensive human-annotated rationales.
Together, RAFA and BRiTE showcase two distinct yet synergistic applications of RL — at the task level and the token level — for enhancing both the decision-making and rationale-generating abilities of LLMs.
Bio: Zhaoran Wang is an associate professor at Northwestern University, working at the interface of machine learning, statistics, and optimization. He is the recipient of the AISTATS (Artificial Intelligence and Statistics Conference) notable paper award, ASA (American Statistical Association) best student paper in statistical learning and data mining, INFORMS (Institute for Operations Research and the Management Sciences) best student paper finalist in data mining, Microsoft Ph.D. Fellowship, Simons-Berkeley/J.P. Morgan AI Research Fellowship, Amazon Machine Learning Research Award, and NSF CAREER Award.