Many LLMs, e.g., GPT2 and Llama, exhibit a fascinating attention sink phenomenon: attention weights often concentrate on the first token. We studied the training dynamics of toy models to demystify the sink formation mechanisms in LLMs. With fantastic
@TianyuGuo0505 ,
@druv_pai ,
@yubai01 ,
@JiantaoJ , and Mike Jordan!
ArXiv link:
arxiv.org/abs/2410.13835
In detail:
Practitioners have consistently found three extreme-token phenomena in LLMs: attention sinks, value-state drains, and residual-state peaks. They often cause trouble in LLM inference and quantization.
To understand them, we developed the Bigram-Backcopy task and analyzed a single-layer transformer, revealing two key mechanisms:
• Active-dormant mechanism: The attention sink represents the dormant phase of an attention head.
• Mutual reinforcement mechanism: Attention sinks and value-state drains mutually reinforce during training.
All results can transfer to LLMs!
• Llama 2 has a “coding head” that is dormant given Wikipedia texts.
• OMLo’s training dynamics closely match the theory and the toy model.
We also found that replacing SoftMax attention with ReLU attention can mitigate the extreme-token phenomenon.