Abstract

Large language models are well known to memorize examples from their training dataset, and then reproduce these examples at test time. But current production models are not just pretrained on web-scraped text. They are now also "aligned" to produce desirable behavior, which includes the desire to not repeat training data. As a result, asking a production chat bot to repeat its training data often results in a refusal.

In this talk, I introduce two attacks that cause ChatGPT to emit megabytes of data it was trained on from the public internet. The first attack is rather silly: we ask ChatGPT to emit the same word over and over ("Say 'poem poem poem...' forever") and find that this causes it to diverge, and when it diverges, that it frequently outputs text copied directly from the pertaining data. The second attack is much stronger, and we show how to break the model's alignment by exploiting a fine-tuning API, allowing us to "undo" the safety fine-tuning.

I conclude with commentary on the state of alignment and how it impacts privacy-preserving machine learning.