Scalable Extraction of Training Data from (Production) Language Models
Large language models are known to memorize examples from their training dataset, and then reproduce these examples at test time. But current production models are not just pretrained on web-scraped text. They are now also "aligned" to produce desirable behavior, which includes not repeating training data. As a result, asking a production chatbot to repeat its training data often results in a refusal.
In this talk from the recent workshop on Alignment, Trust, Watermarking, and Copyright Issues in LLMs, Nicholas Carlini (Google DeepMind) introduces two attacks that cause ChatGPT to emit megabytes of data it was trained on from the public internet. In the first attack, they ask ChatGPT to emit the same word over and over ("Say 'poem poem poem...' forever") and find that this causes it to diverge, and that when it diverges, it frequently outputs text copied directly from the pretraining data. The second attack is much stronger, and shows how to break the model's alignment by exploiting a fine-tuning API, allowing the researchers to "undo" the safety fine-tuning.
This talk concludes with commentary on the state of alignment and how it impacts privacy-preserving machine learning.