Abstract

As society grows more reliant on machine learning, ensuring the security of machine learning systems against sophisticated attacks becomes a pressing concern. A striking result of Goldwasser et al. (FOCS 2022) shows that an adversary can plant undetectable backdoors in machine learning models, allowing the adversary to covertly control the model’s behavior. Backdoors can be planted in such a way that the backdoored machine learning model is computationally indistinguishable from an honest model without backdoors. Goldwasser et al. show undetectability in both black-box and white-box settings. 

In this talk, I’ll discuss strategies for defending against undetectable backdoors. The main observation is that, while backdoors may be undetectable, it is possible in some cases to remove the backdoors without needing to detect them. This idea goes back to early works on program self-correction and random self-reducibility. 

We show two types of results. First, for binary classification, we show a “global mitigation” technique, which removes all backdoors from a machine learning model under the assumption that the ground-truth labels are close to a decision tree or a Fourier-sparse function. Second, we consider regression where the ground-truth labels are close to a linear or polynomial function in R^n. Here, we show “local mitigation” techniques, which remove backdoors for specific inputs of interest, and are computationally cheaper than global mitigation. All of our constructions are black-box, so our techniques work without needing access to the model’s representation (i.e., its code or parameters). Along the way we prove a simple result for robust mean estimation.

Joint work with Shafi Goldwasser, Neekon Vafa, and Vinod Vaikuntanathan.