Abstract
Reinforcement Learning from Human Feedback (RLHF) is the leading technique to align foundation large language (LLM) model with the human preferences, and achieves tremendous successes in the application of Chat-GPT, Gemini, and Claude. Despite its successes, our understanding of this new learning paradigm is still limited especially for the open-source community. In this talk, we begin with a standard mathematical formulation—the reverse-KL regularized contextual bandit—and explore its learnability from a statistical efficiency standpoint. Our findings demonstrate that RLHF benefits from continuous online exploration through interactions with human evaluators. Drawing on these insights, we introduce a novel, provably efficient online iterative training framework. This framework leads to the development of innovative RLHF algorithms such as iterative direct preference learning. Additionally, we will also discuss the practical experimental details to make the state-of-the-art chatbot with only open-source data within this framework, as demonstrated in our open-source project RLHFlow.