xLSTM: An Architecture Much Faster Than Transformers
Sprache des Vortragstitels:
Englisch
Original Kurzfassung:
The currently most successful deep learning architecture is the Transformer, however its computational complexity scales quadratically with the sequence length, its memory requirements grow linearly with the sequence length, its is only based on pairwise associations of sequence elements. In contrast, recurrent neural networks such as LSTMs have a computational complexity that is linear with the sequence length, they have a memory of constant size, and associate a sequence element with a representation of all previous sequence elements. We asked ourselves: how far can we get in language modeling if we scale LSTMs to billions of parameters, using the latest techniques of Transformer architectures, but mitigating the known limitations of LSTMs? This question is answered by xLSTM, which extends LSTM with exponential gating, a matrix memory with a covariance update rule, and full parallelizability like Transformers. xLSTM compares favorably to Transformers and state-space models in terms of both performance and scaling laws. Most importantly, our Trition-Kernels make xLSTM faster than FlashAttention and Mamba, both in training and inference. xLSTM is perfectly suited for embedded and edge applications due to its speed and low, constant memory footprint.