Abstract
Questions of distributivity or localization of information have long plagued the fields of artificial intelligence and cognitive science. Should a single unit encode a single concept, or should all units encode all concepts? Distributed representations power today’s successful neural models in NLP and other domains. As models scale to billions of parameters, we seem to be moving further away from a localist view. In this talk, I will review recent work on identifying the role of individual components such as neurons and attention heads in language models. I will show that such components can be characterized, and that analyzing the internal structure and mechanisms of language models can elucidate their behavior in various cases, including memorization, gender bias, and factual recall. I’ll conclude by demonstrating how such analyses can inform mitigation procedures to make these models more robust and up-to-date.