Abstract

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model P belongs to a known family of models, a special case of which is when models takes a linear parametric form. We propose a general model-based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting *values* as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret for a general model class using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014). We also demonstrate various special cases of the regret results for families of linear transition models.  

Link to paper: https://proceedings.icml.cc/static/paper_files/icml/2020/5817-Paper.pdf

Video Recording