Interactive Proofs, Debate, and AI Safety

Abstract

As AI systems perform increasingly complex tasks, the direct use of human judgements to provide an accurate training signal becomes increasingly difficult. To address this challenge, it is necessary to amplify the ability of humans to oversee and supervise AI training. Luckily, classical work on interactive proofs in computational complexity theory directly studies the ability of a computationally limited verifier to accurately judge the outputs of computationally powerful provers. In this talk I will describe our current work on the theory of AI debate, which gives a complexity-theoretic formalization for the aforementioned goal of amplifying human oversight. The first part of the talk will cover the initial theory around debate, the connections to interactive proofs, and the necessity of relativizing protocols. After this, I will move on to describe new theoretical challenges unique to the debate setting, and methods we have recently developed to overcome them.