The Attention Economy of Compute: A Survey of Efficient Transformer Variants
Main article
Abstract
The quadratic scaling of self-attention with sequence length has motivated an extensive body of research on efficient transformer variants. This survey reviews over 60 such variants, organising them into five architectural families: sparse attention mechanisms (Longformer, BigBird, Reformer), linear attention approximations (Performer, cosFormer, FNet), hierarchical and segment-based approaches (Transformer-XL, Compressive Transformer, LongT5), hybrid state-space and attention models (Mamba, Jamba, Zamba2), and system-level optimisations (FlashAttention, PagedAttention, continuous batching). For each family, we evaluate the fundamental tradeoff made between computational efficiency and modelling expressiveness, summarise empirical performance on long-context benchmarks, and identify the usage regimes in which each approach is most appropriate. We find that the efficiency-expressiveness tradeoff is not uniform across application domains: for tasks with strong local structure (genomics, time-series), state-space models provide compelling alternatives to full attention; for tasks requiring global, document-level reasoning, sparse attention with carefully designed patterns continues to outperform. We conclude by identifying three directions — hardware-aware algorithm co-design, dynamic sparsity, and hybrid architectures — as the most promising for continued efficiency gains.
