Abstract
Defenses for adversarial attacks against neural networks can broadly be divided into two categories: empirical and certifiable. Empirical defenses offer good guarantees against existing attacks, but we do not know how to prove that they cannot be broken. In the past couple of years, there has been a slew of empirical defenses, many of which have subsequently been broken. A notable exception is adversarial training (Madry et al 2018), variants of which are still the state of the art for empirical defenses. Certifiable defenses aim to avoid this arms race of defenses and attacks by providing provable guarantees that the network is robust to adversarial attacks. One promising certifiable defense is randomized smoothing (Lecuyer et al 2018, Li et al 2018, Cohen et al 2018), however their numbers still lag compared to those achieved by empirical defenses.
In this work we demonstrate that by combining adversarial training with randomized smoothing, we are able to substantially boost the provable robustness of the resulting classifier. We derive an attack against smoothed neural network classifiers, and we train the neural network using this attack, via the adversarial training paradigm. While adversarial training by itself typically gives no provable guarantees, we demonstrate that the certified robustness of the resulting classifier is substantially higher than previous state of the art, improving up to 16% for ell_2 perturbations on CIFAR-10, and 10% on ImageNet. By combining this with other ideas such as pre-training (Hendrycks et al 2019), we are able to improve these numbers further, establishing the state of the art for ell_2 provable robustness.