Abstract

Language models exhibit in-context learning (ICL), the ability to learn new tasks from just a few examples of prompts in the context presented to them. Prior work has studied ICL through the lens of simple learning problems like linear regression, but there remains a gap in understanding the rich language generation capabilities exhibited in real language models. In this talk, I will discuss a new model problem for understanding ICL — in-context learning of (formal) languages (ICLL). In ICLL, language models are presented with example strings from a probabilistic language and must generate additional strings from that same language. Focusing on regular languages sampled from random finite automata, we study the behavior of a variety of sequence models on the ICLL task. We show that Transformers significantly outperform recurrent and convolutional models on these tasks. Moreover, we find evidence that their ability to do so relies on specialized "n-gram heads" (higher-order variants of induction heads) that compute input-conditional next-token distributions. Finally, we show that hard-wiring these heads into neural models improves performance not just on formal language learning, but modeling of real natural-language text — improving the perplexity of 340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.

Video Recording