Abstract
In this talk, we will discuss the mechanisms that enable retrieval, copying, and length generalization in language models, as well as how the choice of network architecture influences the model's success or failure in basic tasks. First, we will present theoretical and empirical evidence demonstrating that Transformers, the dominant architecture for sequence modeling, excel at copying and retrieval tasks, whereas LSTM and state-space models (e.g., Mamba) perform poorly on these same tasks. Next, we will show how the ability of Transformers to copy long sequences can be leveraged to achieve length generalization across various algorithmic and arithmetic tasks.