LINEAR LOG-NORMAL ATTENTION WITH UNBIASED CONCENTRATION
Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length.