On the Curses of Future and History in Off-policy Evaluation in non-Markov Environments

Abstract

Coverage is a central concept in learning decision-making strategies from data, which characterizes how much the data---which is collected using a certain policy---tells us about a *different* policy. The mathematical characterization of coverage is extensively studied in Markov settings, e.g., offline RL in MDPs, where the basic definition takes the form of state density-ratio boundedness. However, real-world applications of RL, including RLHF in LLMs, often deal with non-Markov observations (e.g., sequence of tokens), and a direct reduction to the Markov case (by treating history as state) leads to exponentially large coverage coefficients, which is non-satisfactory in theory and also fails to explain practical successes.

In this work, we aim to develop better theoretical understanding of coverage in non-Markov environments, in a minimal setup of off-policy evaluation (OPE) in partially observable MDPs (POMDPs). We propose a novel framework called future-dependent value functions, and identify coverage assumptions tailored to the structure of POMDPs, namely belief coverage and outcome coverage. Under these coverage assumptions, we provide estimators that enable the first polynomial sample complexity (in the sense of no exponential dependence on horizon) guarantee for OPE in POMDPs.

Attachment

Slides

On the Curses of Future and History in Off-policy Evaluation in non-Markov Environments

Abstract

Attachment

Video Recording