Abstract
In this work, we study policy-space methods for solving the reinforcement learning (RL) problem. We first consider the setting where the model is known (MDP setting) and show that the Natural Policy Gradient (NPG) algorithm (and other variants) have linear (geometric) convergence without the need of any regularization. In contrast to optimization approaches used in the literature, our approach is based on approximate dynamic programing. We then consider a linear function approximation variant of it and establish its convergence guarantees. Finally, we consider the RL setting where a critic is deployed for policy evaluation when the model is unknown. We consider a critic that is based on TD learning, and uses off-policy sampling with linear function approximation. This leads to the infamous deadly-triad issue. We propose a generic algorithm framework of a single time-scale multi-step TD-learning with generalized importance sampling ratios that enables us to overcome the high variance issue in off-policy learning, and establish its finite-sample guarantees. We show that this leads to an overall sample complexity of O(epsilon^-2).