This note describes the GPTQ quantizer format, not how parameters are updated for quantization (see the GPTQ paper and this paper for some subsequent updates).

Preliminaries

Asymmetric quantization

If we quantize weights to bits, the weights are stored as the integer values . If the smallest weight within a set of weights is and our largest weight is . We want to map the range to the range of integer values .

To do this mapping, we first define a scaling factor , where . Suppose that we are quantizing weights using 4-bit integers, then and we have and . Then . We can then divide the weights by and round the result to get integer weights, i.e. . Now , so after rounding the weights can be represented using 16 integer values, but we still need to add a bias (called from here on) to get to the desired range of . To do so, we can add .

Once we have determined and , we can find the quantized weight using and the reconstructed weight using .

Typically, is also stored in -bits, but this only works when (otherwise we get negative ) and (otherwise ). However, this should be true for most sets of weights.

Real-world quantization We could make a very simple quantizer by finding and for a set of model parameters as-is, but it would usually result in an inaccurate model. The rounding in the quantizer may be to crude to represent the weight distribution without too much loss. Real-world quantization methods update a model's weights to work in tandem with the quantizers.

Symmetric quantization

The idea behind symmetric quantization is that (conversely, this condition is not necessary in asymmetric quantization. In order to do so, their values are redefined as and . The value of is always going to be the same in this case, .

Symmetric quantization is the default in the most popular GPTQ implementations. I am not sure why, because fidelity is lost in cases where and are not close to and respectively.

Packing integers

When the values are quantized, they are not stored as-is. They are packed into int32 values. This is dual purpose: operations on 32-bit integers is fast on most hardware and this packing avoids wasting bits. When we are packing weights such that , packing is straightforward using bit shifts. For instance, we can pack four 8-bit weights using bitwise operators: packed = w0 | (w1 << 8) | (w2 << 16) | (w3 << 24). We can unpack e.g. w2 by using (w2 = packed >> 16) & 0xff.

Masking & 0xff in the example unpacking example above is the mask to clear out all bits except the 8 least significant bits. The mask can be computed using 2**bits-1.

AutoGPTQ supports 2, 3, 4, and 8-bit quantization. 3-bit quantization is the odd-one-out, because . We could pack 30 bits into a 32-bit integer, but we would have two redundant bits. This is addressed by packing 32 3-bit integers in 3 int32s. Packing/unpacking is fiddly because the integers contain partial values.

Luckily, 4-bit GPTQ quantization seems to have become the standards and some kernels (e.g. ExLlama) only support 4-bit GPTQ.

Let's ignore 3-bit GPTQ For the remainder of this note I'll assume that , so no 3-bit quantization. It makes the shape definitions a little cleaner.

Bug in common implementations Many implementations have an issue in packing that will result in invalid values when used with asymmetric quantization. During packing, they subtract from values and then convert to uint32 for packing. Then they add during unpacking. However, this results in an incorrect value when before packing. For example:

>>> zeros_uint32 = (torch.arange(16, dtype=torch.float16) - 1).to(torch.uint32)
>>> zeros_uint32
tensor([ 0,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14], dtype=torch.uint32)
>>> zeros_uint32.to(torch.float16) + 1
tensor([ 1.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15.], dtype=torch.float16)

The bug is quite unfortunate, because it makes asymmetric quantization perform much less well than it should. However, it cannot be resolved without breaking compatibility with existing checkpoints. There is work on a new version of the GPTQ format to solve this issue.

How GPTQ quantizers are stored

What are we quantizing?

Before looking at the storage format, it’s a good idea to take a step back and look at what we are quantizing. In transformers these are linear layers. In Torch we construct a linear layer using:

linear = nn.Linear(in_features, out_features)

Linear will store the weights as a matrix with shape [out_features, in_features] and is applied to an input vector as . We could quantize the matrix with a single and . However, this would typically lead to a large quantization loss, since there can be a large variance in weights. GPTQ uses and parameters for every output feature (row). This makes most sense, since the weights in a row participate in the same dot-product in .

Simplified format

We will first start with a simplified description of GPTQ before getting to the actual format, to see how it naturally follows from Preliminaries, before some additional complexity is added. This simplified GPTQ uses stores the weight matrix with shape [out_features, in_features], with -bit quantization and storing quantized weights in an int32:

Parameter suffixShapeTypeDescription
qweight(out_features, in_features/c)int32Integer-quantized weights, packed
scales(out_features,)float16 per output feature
qzeros(out_features/c,)int32 per output feature, packed

Since we are quantizing rows, we have a and per row. The only quirk is that the quantized weights and zeros are packed by storing c values in an int32.

Groups

Since having per-row scale and zero reduces the quantization loss, we could do the same for the columns. Of course, this it be rather pointless to have in_features scales/zeros or each output feature, because then the scales matrix would consume as much memory as the original weight matrix. As a compromise, GPTQ divides in_features evenly in n_groups groups instead. In the GPTQ quantizer configuration, this is usually configured as group_size (group_size = in_features / n_groups), resulting in the following shapes:

Parameter suffixShapeTypeDescription
qweight(out_features, in_features/c)int32Integer-quantized weights, packed
scales(n_groups, out_features)float16 per group + output feature
qzeros(n_groups, out_features/c)int32 per group + output feature
g_idx(in_features,)int32The group identifier for each input feature

scales and qzeros are now matrices to store the parameters for n_groups groups. The new g_idx tensor maps input features to groups. So to get the scale for input feature/column i, we get group=g_idx[i] and we can then get the scales/zeros using scales[group]/qzeros[group].

Quantizer parameter groups are optional. Disabling this functionality is equivalent to using one group and g_idx==0.

Transposition

The second difference between the simplified GPTQ format and the actual format is that the weight matrix is transposed before storage, so in the checkpoints we see the following parameters:

Parameter suffixShapeTypeDescription
qweight(in_features/c,out_features)int32Integer-quantized weights, packed
scales(n_groups, out_features)float16 per output feature in a group
qzeros(n_groups, out_features/c)int32 per output feature in a group
g_idx(in_features,)int32The group identifier for each input feature

And this is the actual storage format that you will find in PyTorch checkpoints of GPTQ models on e.g. the Hugging Face hub.

Other GPTQ configuration options

These are some other configuration options that change how the quantizer works, but have no ramifications on the serialized format:

  • desc_act: the GPTQ quantization method is sensitive to the order in which weights are processed. When this option is enabled, the weights are sorted by descending activation. This prioritizes reduction of the quantization loss of parameters that have a larger impact on activations. Activation sorting make quantizer parameter lookups less efficient β€” the quantizer is constructed from the permuted weight matrix, so the scales and qzero lookups are random-access after quantization.
  • static_groups: pre-calculates the quantization parameters before permuting the weight matrix when desc_act is used. Avoids the random-access pattern that desc_act introduces.