This note describes the GPTQ quantizer format, not how parameters are updated for quantization (see the GPTQ paper and this paper for some subsequent updates).

# Preliminaries

## Asymmetric quantization

If we quantize weights to $b$ bits, the weights are stored as the integer values $[0,2_{b}β1]$. If the smallest weight within a set of weights $w$ is $w_{min}=min(w)$ and our largest weight is $w_{max}=max(w)$. We want to map the range $[w_{min},w_{max}]$ to the range of integer values $[0,2_{b}β1]$.

To do this mapping, we first define a scaling factor $scale=q_{max}w_{max}βw_{min}β$, where $q_{max}=2_{b}β1$. Suppose that we are quantizing weights using 4-bit integers, then $q_{max}=2_{b=4}β1=15$ and we have $w_{min}=β1.5$ and $w_{max}=3.0$. Then $scale=153.0ββ1.5β=0.3$. We can then divide the weights by $scale$ and round the result to get integer weights, i.e. $[round(0.3β1.5β),round(0.33.0β)]=[β5,10]$. Now $scalew_{max}ββscalew_{min}β=15$, so after rounding the weights can be represented using 16 integer values, but we still need to add a bias (called $zero$ from here on) to get to the desired range of $[0,2_{b=4}β1]$. To do so, we can add $zero=round(scaleβw_{min}β)=round0.3ββ1.5β=5$.

Once we have determined $scale$ and $zero$, we can find the quantized weight $q(w_{i})$ using $q(w_{i})=clamp(round(scalew_{i}β)+zero)$ and the reconstructed weight using $q_{β²}(q(w_{i}))=(q(w_{i})βzero)βscale$.

Typically, $zero$ is also stored in $b$-bits, but this only works when $w_{min}β€0$ (otherwise we get negative $zero$) and $w_{max}β₯0$ (otherwise $zero>15$). However, this should be true for most sets of weights.

Real-world quantization We could make a very simple quantizer by finding $scale$ and $zero$ for a set of model parameters as-is, but it would usually result in an inaccurate model. The rounding in the quantizer may be to crude to represent the weight distribution without too much loss. Real-world quantization methods update a model's weights to work in tandem with the quantizers.

## Symmetric quantization

The idea behind *symmetric quantization* is that $β£w_{min}β£=β£w_{max}β£$ (conversely, this condition is not necessary in *asymmetric quantization*. In order to do so, their values are redefined as $w_{max}=max(β£wβ£)$ and $w_{min}=βw_{max}$. The value of $zero$ is always going to be the same in this case, $zero=22_{b}β$.

Symmetric quantization is the default in the most popular GPTQ implementations. I am not sure why, because fidelity is lost in cases where $w_{min}$ and $w_{max}$ are not close to $min(w)$ and $max(w)$ respectively.

## Packing integers

When the values are quantized, they are not stored as-is. They are packed into `int32`

values. This is dual purpose: operations on 32-bit integers is fast on most hardware and this packing avoids wasting bits. When we are packing weights such that $32ΒmodΒb=0$, packing is straightforward using bit shifts. For instance, we can pack four 8-bit weights using bitwise operators: `packed = w0 | (w1 << 8) | (w2 << 16) | (w3 << 24)`

. We can unpack e.g. `w2`

by using `(w2 = packed >> 16) & 0xff`

.

Masking

`& 0xff`

in the example unpacking example above is the mask to clear out all bits except the 8 least significant bits. The mask can be computed using`2**bits-1`

.

AutoGPTQ supports 2, 3, 4, and 8-bit quantization. 3-bit quantization is the odd-one-out, because $32ΒmodΒb=2$. We could pack 30 bits into a 32-bit integer, but we would have two redundant bits. This is addressed by packing 32 3-bit integers in 3 `int32`

s. Packing/unpacking is fiddly because the integers contain partial values.

Luckily, 4-bit GPTQ quantization seems to have become the standards and some kernels (e.g. ExLlama) only support 4-bit GPTQ.

Let's ignore 3-bit GPTQ For the remainder of this note I'll assume that $32ΒmodΒb=0$, so no 3-bit quantization. It makes the shape definitions a little cleaner.

Bug in common implementations Many implementations have an issue in packing that will result in invalid $zero$ values when used with asymmetric quantization. During packing, they subtract $1$ from $zero$ values and then convert $zero$ to

`uint32`

for packing. Then they add $1$ during unpacking. However, this results in an incorrect value when $zero=0$ before packing. For example:The bug is quite unfortunate, because it makes asymmetric quantization perform much less well than it should. However, it cannot be resolved without breaking compatibility with existing checkpoints. There is work on a new version of the GPTQ format to solve this issue.

# How GPTQ quantizers are stored

## What are we quantizing?

Before looking at the storage format, itβs a good idea to take a step back and look at what we are quantizing. In transformers these are linear layers. In Torch we construct a linear layer using:

```
linear = nn.Linear(in_features, out_features)
```

`Linear`

will store the weights as a matrix with shape `[out_features, in_features]`

and is applied to an input vector as $Wx$. We could quantize the matrix with a single $scale$ and $zero$. However, this would typically lead to a large quantization loss, since there can be a large variance in weights. GPTQ uses $scale$ and $zero$ parameters for every output feature (row). This makes most sense, since the weights in a row participate in the same dot-product in $Xw$.

## Simplified format

We will first start with a simplified description of GPTQ before getting to the actual format, to see how it naturally follows from Preliminaries, before some additional complexity is added. This simplified GPTQ uses stores the weight matrix with shape `[out_features, in_features]`

, with $b$-bit quantization and storing $c=βb32ββ$ quantized weights in an `int32`

:

Parameter suffix | Shape | Type | Description |
---|---|---|---|

`qweight` | `(out_features, in_features/c)` | `int32` | Integer-quantized weights, packed |

`scales` | `(out_features,)` | `float16` | $scale$ per output feature |

`qzeros` | `(out_features/c,)` | `int32` | $zero$ per output feature, packed |

Since we are quantizing rows, we have a $scale$ and $zero$ per row. The only quirk is that the quantized weights and zeros are packed by storing `c`

values in an `int32`

.

## Groups

Since having per-row scale and zero reduces the quantization loss, we could do the same for the columns. Of course, this it be rather pointless to have `in_features`

scales/zeros or each output feature, because then the `scales`

matrix would consume as much memory as the original weight matrix. As a compromise, GPTQ divides `in_features`

evenly in `n_groups`

groups instead. In the GPTQ quantizer configuration, this is usually configured as `group_size`

(`group_size = in_features / n_groups`

), resulting in the following shapes:

Parameter suffix | Shape | Type | Description |
---|---|---|---|

`qweight` | `(out_features, in_features/c)` | `int32` | Integer-quantized weights, packed |

`scales` | `(n_groups, out_features)` | `float16` | $scale$ per group + output feature |

`qzeros` | `(n_groups, out_features/c)` | `int32` | $zero$ per group + output feature |

`g_idx` | `(in_features,)` | `int32` | The group identifier for each input feature |

`scales`

and `qzeros`

are now matrices to store the parameters for `n_groups`

groups. The new `g_idx`

tensor maps input features to groups. So to get the scale for input feature/column `i`

, we get `group=g_idx[i]`

and we can then get the scales/zeros using `scales[group]`

/`qzeros[group]`

.

Quantizer parameter groups are optional. Disabling this functionality is equivalent to using one group and `g_idx==0`

.

## Transposition

The second difference between the simplified GPTQ format and the actual format is that the weight matrix is transposed before storage, so in the checkpoints we see the following parameters:

Parameter suffix | Shape | Type | Description |
---|---|---|---|

`qweight` | `(in_features/c,out_features)` | `int32` | Integer-quantized weights, packed |

`scales` | `(n_groups, out_features)` | `float16` | $scale$ per output feature in a group |

`qzeros` | `(n_groups, out_features/c)` | `int32` | $zero$ per output feature in a group |

`g_idx` | `(in_features,)` | `int32` | The group identifier for each input feature |

And this is the actual storage format that you will find in PyTorch checkpoints of GPTQ models on e.g. the Hugging Face hub.

# Other GPTQ configuration options

These are some other configuration options that change how the quantizer works, but have no ramifications on the serialized format:

`desc_act`

: the GPTQ quantization method is sensitive to the order in which weights are processed. When this option is enabled, the weights are sorted by descending activation. This prioritizes reduction of the quantization loss of parameters that have a larger impact on activations. Activation sorting make quantizer parameter lookups less efficient β the quantizer is constructed from the permuted weight matrix, so the`scales`

and`qzero`

lookups are random-access after quantization.`static_groups`

: pre-calculates the quantization parameters before permuting the weight matrix when`desc_act`

is used. Avoids the random-access pattern that`desc_act`

introduces.