Abstract
I will discuss mechanisms for establishing provenance of two types of language model artifacts: text and weights.
In the first part of the talk, I will cover work (joint with John Thickstun, Tatsu Hashimoto, and Percy Liang) on watermarking text generated by an autoregressive language model. We leverage the inherent randomness of token sampling to construct the first watermarks that are robust to editing a constant fraction of the text without changing the distribution over text (up to a certain generation budget).
In the second part of the talk, I will cover work (joint with Sally Zhu, Ahmed Ahmed, and Percy Liang) on testing whether two language models were independently trained based on their weights. We leverage the inherent randomness of model training to develop exact post-hoc tests of independence without intervening on the training process.