Abstract
We study Transformer's reasoning capabilities by formulating sequential reasoning tasks with finite-state automata. We show that o(T)-layer Transformers can simulate T steps of sequential reasoning, leveraging tools from Krohn-Rhodes theory and circuit complexity. However, our empirical findings reveal that models trained in practice often fail to discover such optimal constructions, due to optimization challenges and the richness of representation. Notably, Transformers struggle with out-of-distribution generalization on a simple task easily solved by RNNs. The two types of OOD failures highlight two inherent limitations of the Transformer architecture.