Abstract
The quadratic time required to compute attention layers in transformers is a major bottleneck for long context lengths. I will survey recent approximation algorithms based on dimensionality reduction, which under certain assumptions, achieve linear time. I will focus on HyperAttention and PolySketchFormer, discussing their theory and practice, and also mention recent followup work.