Softcapping (Gemma Team, 2024) avoids that logits become very large. However, in contrast to simply clipping/clamping the logits, it is smooth and differentiable. Soft-capping is defined as follows, where is a logit vector and we want to normalize the values to be within :
The following graph compares clipping at (red) with soft-capping (blue) with .
Soft-capping is used by Gemma 2 for attention logits and the final (piece prediction) layer.