Formal backdoor detection games and deceptive alignment

Abstract

A backdoor in a machine learning model is when an adversary modifies the model so that it behaves normally on typical inputs, but differently on certain inputs that activate a secret "trigger". As well as being interesting in their own right, backdoors also serve as an analogy for deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In this talk, I will introduce a formal notion of defendability against backdoors that involves a game between an attacker and a defender. The game is fairly simple, but nevertheless gives rise to a rich array of strategies for both the attacker and the defender, involving learning and obfuscation. I will explain our theoretical results about these strategies and the implications that they may have for mitigating deceptive alignment.

Attachment

LLM24-2 Slides - Jacob Hilton.pdf