Abstract

A key task in reinforcement learning is best policy identification. We study a version of this problem in structured bandits (single state MDPs), and investigate the sample complexity as a function of the action and feedback models. Concretely, we revisit the question: "which versions of my website are better than the control". We address this question in the active sequential testing framework, where we explicitly model the presence of subpopulations among the users. For three modes of interaction depending on what is observed about these subpopulations, we characterize the optimal sample complexity, develop matching algorithms and validate the methods experimentally.

Attachment