The Attention Economy of Compute: A Survey of Efficient Transformer Variants

Rami Al-Haddad; Silvia Gómez-Reyes; Chen Wei

doi:10.63646/

Published 2024-09-30

Rami Al-Haddad*

KAUST AI Initiative, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia, 23955
rami.alhaddad@kaust.edu.sa

Silvia Gómez-Reyes

Departamento de Computación, Universidad Politécnica de Madrid, Madrid, Spain, 28040

Chen Wei

School of Intelligence Science and Technology, Peking University, Beijing, China, 100871

Abstract

The quadratic scaling of self-attention with sequence length has motivated an extensive body of research on efficient transformer variants. This survey reviews over 60 such variants, organising them into five architectural families: sparse attention mechanisms (Longformer, BigBird, Reformer), linear attention approximations (Performer, cosFormer, FNet), hierarchical and segment-based approaches (Transformer-XL, Compressive Transformer, LongT5), hybrid state-space and attention models (Mamba, Jamba, Zamba2), and system-level optimisations (FlashAttention, PagedAttention, continuous batching). For each family, we evaluate the fundamental tradeoff made between computational efficiency and modelling expressiveness, summarise empirical performance on long-context benchmarks, and identify the usage regimes in which each approach is most appropriate. We find that the efficiency-expressiveness tradeoff is not uniform across application domains: for tasks with strong local structure (genomics, time-series), state-space models provide compelling alternatives to full attention; for tasks requiring global, document-level reasoning, sparse attention with carefully designed patterns continues to outperform. We conclude by identifying three directions — hardware-aware algorithm co-design, dynamic sparsity, and hybrid architectures — as the most promising for continued efficiency gains.

Keywords: transformers; efficient attention; Mamba; FlashAttention; sparse attention; state-space models; long-context; compute efficiency

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Al-Haddad, R., Gómez-Reyes, S., & Wei, C. (2024). The Attention Economy of Compute: A Survey of Efficient Transformer Variants. DATAMIND, 2(3), 1-4. https://doi.org/10.63646/

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite