Abstract

Frontier language models exhibit striking generalization across languages (human languages, programming languages, and encoding schemes). They also excel at analogies ("give an X in the style of Y"). At the same time, as other speakers will discuss, they can be quite brittle on 'reasoning'-like tasks, highly sensitive to changes in question formulation or variable name. In this talk, I will discuss the geometry of LLM representations from the perspective of 'superposition', a strategy for finding (sparse autoencoders), and surprising qualitative examples of features from a frontier model. I will also share open mathematical and empirical questions on LLM representations, with a focus on implications for computation, training, failure, and reliability.