annalhq

LLaMA 4 Softmax

In their official blog on LLaMA 4, Meta AI implemented SSMax in LLMs.

Additionally, we employ inference time temperature scaling of attention to enhance length generalization. We call this the iRoPE architecture, where “i” stands for “interleaved” attention layers, highlighting the long-term goal of supporting “infinite” context length, and “RoPE” refers to the rotary position embeddings employed in most layers.

The most commonly used attention mechanism in the current Transformer architecture is formally called "Scaled Dot-Product Attention." The term "Scaled" refers to the fact that after the multiplication of Q and the transpose of K, the result is divided by sqrt(d), where d is the dimension of the key vectors. before applying the Softmax function without loss of generality with the following assumption:

$Q, K, V \in \mathbb{R}^{n \times d}$

$Attention(Q,K,V) = softmax\left(\frac{QK^{\top}}{\sqrt{d}}\right)V$

Entropy variance

(to be added)

SSMax

It scales the query states to allow softmax to:

Improve performance on longer context length and key information retrieval tasks.
Process longer context length more effciently.

Whereas in normal attention softmax, as context length grows, the softmaxxed probabilty distribution becomes flatter as the denominator gets larger progressively. This leads to a loss of information and a decrease in performance.

$z_i \mapsto \dfrac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}$

The solution is to incorporate N directly into the softmax computation, setting s to a scalar is good enough. Also it can be implemended as a simple scaling of the queries.

\begin{align*} \mathbf{a_n} &= \mathrm{SSMax}\left(\dfrac{\mathbf{q_n}K_{1:n}^T}{\sqrt{d}}\right) \\ &= \mathrm{softmax}\left(\dfrac{(s \log{n}) \mathbf{q_n}K_{1:n}^T)}{\sqrt{d}}\right) \end{align*}

LLaMA 4 SSMax

LLaMA 4 Softmax

Attention Mechanism

Entropy variance

SSMax

References