AI Model

Linear Next: The Evolution of LLM Architecture

May 6

14:40 - 15:20

The Transformer architecture, despite its popularity, suffers from quadratic computational complexity. Recent advances in computing hardware, such as the V100 to H200 series, have temporarily alleviated this issue, reducing the immediate need for alternatives in the industry. Linear-complexity solutions for large models are still in the research phase, lacking widespread validation in practical applications. Consequently, Transformer remains the preferred choice. However, as improvements in computing power slow down, the demand for architectures that surpass Transformer in efficiency will grow. Our team has developed Lightning Attention, a novel mechanism based on linear attention. By rearranging the QKV multiplication order (Q(KV)), Lightning Attention achieves linear computational complexity relative to sequence length. Experiments show it significantly outperforms the latest Transformers in both efficiency and performance, validated on a 456B MoE model (MiniMax 01). This innovation paves the way for more efficient large language models, offering new possibilities for future development.

Speakers

Yiran Zhong

Senior Research Director at MiniMax