Large Language Models

xLSTM -- Extended Long Short-Term Memory

How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs?