Logit Softcapping

Softcapping (Gemma Team, 2024) avoids that logits become very large. However, in contrast to simply clipping/clamping the logits, it is smooth and differentiable. Soft-capping is defined as follows, where $x$ is a logit vector and we want to normalize the values to be within $(- t, t)$ :

$softcap (x) = t \cdot tanh (\frac{x}{t})$ The following graph compares clipping at $- t, t$ (red) with soft-capping (blue) with $t = 5$ .

Soft-capping is used by Gemma 2 for attention logits and the final (piece prediction) layer.

🥝 Daniël's Garden

Explorer

Recent Updates

Socials

Hardware

OPNsense KPN Fiber

Logit Softcapping

Graph View

Backlinks