Quantizer width notation

In quantization we often see the notation WmAn, for instance W8A16. This describes the width of the weights (W) and activations (A). For instance a W4A16 quantizer will use 4-bit weights in a linear layer, but will use 16-bit inputs (activations).

Usually the width of the weights is equal to or less than the activations. Different configurations have benefits and downsides:

  • WmAm can be faster, because GPUs are more likely to support operations on operands with the same width than with mixed widths.
  • Yet, it can be attractive to use WmAn with n > m, because activations are generally harder to quantize (see e.g. Section 3 for a survey).