Quantizing a Neural Network

Recently I have been working on quantizing typical neural networks to uint8 to figure out the influence on their performance. Today I got some result, which shows that to Inception-BN and VGG model on CIFAR10 and ImageNet dataset, the effect on the performance is negligible, from which I believe we can conclude that uint8 quantization will not decrease performance. The concrete results will be attached at the end of the post.

This post is to write down some important conclusions and thinkings.

What is Quantization

From my understanding, the so called quantization is nothing but the process to change the floating-point storage and operations into fixed-point. The main purposes are:

  • save space
  • speed up by utilizing integer operations

The first goal is obvious and immediate, e.g. 4x space will be saved upon converting from float32 to uint8. Note that it is just storage space, while the runtime memory is another story. On the other hand, the second goal is not that easy to achieve, which requires us to investigate into every single operators, and find corresponding steps and optimizations.

There are mainly two ways of quantization:

  • Linear quantization
    • use a linear mapping to (de)quantize
  • Codebook quantization
    • (de)quantize by looking up a mapping table

In fact, those two are essentially equivalent, which I believe is also the essence of quantization:

to find a invertible mapping from floating-point to fixed-point numbers, and to transform operations under such mapping correspondingly

The difference between different ways of quantization results in the fact that we have to re-define the operations between elements! For example:

  • Linear quantization
    • element mapping: $x\mapsto a\hat{x}+b$
    • element addition: $\widehat{x+y} = \hat{x}+\hat{y}+ \frac{b}{a}$, which is NOT simply $\hat{x}+\hat{y}$
  • Codebook quantization
    • element mapping:look up a table
    • element addition: look up twice

As a result, different mapping indicates different optimization on computation. The Deep Compression by Song Han chooses the codebook approach, and utilizes the K-means algorithm to generate the codebook, while he also designed a designated hardware to speed up the looking up operation. On the contrary, I chose the linear method, since it can utilize some mathematics.

In fact, we can define the process of quantization using the language from Category Theory:

to find a functor from floating-point category to fixed-point category

which not only implies the mapping between elements, but also that between operators!

Formalization of Linear Quantization

We call $Q_V(\mathbb{R})$ a linear quantized space to finite integer ring V from $\mathbb{R}$.

The elements of $Q_V(\mathbb{R})$ are denoted by $(\hat{x}, l, u)$, and are referred to as the quantized number. We call $\hat{x}\in V$ the quantized value, and $(l, u), l\leqslant u\in \mathbb{R}$ the quantizing range. Sometimes we can omit the range when it is clear from the context, but must keep in mind that every element of $Q_V(\mathbb{R})$ must be attached by two real number $l$ and $u$ representing the endpoints of the interval to which the quantized value is quantizing.

The quantize mapping from $\mathbb{R}$ to $Q_V(\mathbb{R})$ is:

$$q_{l,u}:x\mapsto \left(\left[\frac{x-l}{\Delta}\right], l, u\right)$$

where the round bracket is saturated to the range of $V$. The dequantize mapping is:

$$q^{-1}:(\hat{x}, l, u) \mapsto \hat{x}\Delta + l$$

where

$$\Delta = \frac{u-l}{|V| - 1}$$

is the minimum representable difference.

In this post we will consider the ring $U$ be the ring consists of uint8 integers, in which the addition and multiplication is saturated version,i.e. capped by 0 and 255. We have $|U|=256$, and denote it by $S=255$ from now on.

Liearn Invariance

It is easy to see a simple property of linear quantized space, which I called the linear invariance: i.e. if

$$q_{l,u}(x)=(\hat{x},l,u)$$

then

$$q_{l,u}(ax+b)=(\hat{x}, al+b, au+b)$$

As a result, when linearly transforming a bunch of quantized numbers under same range, we need to do nothing but update the two endpoints of the range under the linear transformation. In particular, we do NOT need to update the quantzied values of those numbers!

Range Adjustment

We call two quantized number are equal if they dequantized to the same number in $\mathbb{R}$. However it seldom happens, since the precision of quantized numbers are limited. So we define two quantized numbers $(\hat{x}, l, u)$ and $(\hat{y}, l', u')$ to be almost equal if

$$\left|(\Delta\hat{x}+l) - (\Delta'\hat{y} + l')\right| < \frac{1}{2}\min{(\Delta,\Delta')}$$

which looks terrifying, but in fact it is saying nothing but that $\hat{x}$ and $\hat{y}$ are the best choices to represent the same real number under different quantizing ranges.

There is a common operation within space $Q_V(\mathbb{R})$ itself, which I called range adjustment: if $(\hat{x}, l, u)$ and $(\hat{y}, l', u')$ are almost equal, then

$$\begin{aligned} \hat{y}&= \left[\frac{\hat{x}\Delta+l-l'}{\Delta'}\right]\\ \end{aligned}$$

This operation is so common that we must find a fast way to compute it. So far I just naively implemented it.

Operations in Linear Quantized Space

In this section I provide the quantized version of several common operators in neural network.

Fully Connected

Please refer to UInt8 Matrix Multiplication.

Convolution

In GPU, CuDNN provides the API for floating-point convolution, but no uint8 version. Luckily at least CUDA 8.0 provides uint8 GEMM API, so I can write the convolution by myself, before which I have to write im2col functions. The detail is in Implement Convolution in CNN.

ReLU Activation

ReLU is computed in $\mathbb{R}$ as:

$$x\mapsto\max{(x,0)}$$

In $Q_V(\mathbb{R})$ , if the input range is $(l, u)$ where $l<0, u>0$, the output range should be $(0, u)$, and we have the mapping
$$\begin{aligned} &(\hat{x}, l, u) \mapsto (\hat{y}, 0, u)\\ \hat{y}&= \left[\frac{1}{\Delta_y}(\max{(\hat{x}\Delta_x+l, 0)} - 0)\right]\\ &= \left[\frac{\max{\left(\hat{x}\frac{u-l}{S}+l, 0\right)}\cdot S}{u}\right]\\ &=\left[\max{\left(\hat{x}\left(1-\frac{l}{u}\right)+\frac{l}{u}S, 0\right)}\right]\\ &=\hat{x} - \left[(S - \hat{x})\frac{-l}{u}\right] \end{aligned}$$

where

  • the last minus is saturated
  • note that on uint8, $S - \hat{x}$ is still in uint8

All other cases on the input range is obvious and need no computation.

Batch Normalization

The input of a batch normalization layer is a tensor of dimension $(N, C, H, W)$. The layer will update $C$ tensors of shape $(N, H, W)$ along the second dimension by:

$$x \mapsto \gamma_c\frac{x-\mu_c}{\sigma_c} + \beta_c$$

where $\mu_c, \sigma_c$ is the mean and standard deviation of the $c$th tensor, and $\gamma_c$ and $\beta_c$ are pre-determined parameters for each $c$

By linear invariance, the quantized batch norm can be calculated as:

  • For each $c$, let $l_c =f_c(l), u_c = f_c(u)$, where $f_c(x) = \gamma_c\frac{x-\mu_c}{\sigma_c} + \beta_c$
  • Find $l=\min{(l_i)}, u = \max{(u_i)}$
  • For each $c$, adjust the range of $(\hat{x}, l_c, u_c)$ to $(\hat{y}, l', u')$

Furthermore, instead of calculating $\mu_c, \sigma_c$ on air, they are passed in as parameters, since they are calculated and movingly accumulated during training process. As a result, the speed bottleneck is at the third step, i.e. adjusting $C$ ranges to a uniformed one.

Max and Avg Pooling

Obviously, quantization preserves max and avg. In other words, we can perform the corresponding operation directly on uint8 integers without changing the range. Note that we need a intermediate variable with large type such as int when performing avg.

However, always we need to adjust the range at end for better accuracy, which will cause extra time.

Concat

We only need to adjust the ranges of each part to a uniformed one.

Determine Operator Output Range

There is another important issue that must be solved: how to determine the output ranges of each operator given the quantized input numbers, i.e. a bunch of uint8 values and their quantizing ranges?

The output range of some operator, such as ReLU, is obvious, while some not, e.g. fully connected, which is matrix multiplication. Initially I tried several heuristic to estimated the output range. For example, I found that for a $(m,n)$ matrix with range $(l_1,u_1)$ multiplying $(n, k)$ matrix with range $(l_2,u_2)$, the output range can be roughly estimated by

$$(l,u)\times\sqrt{n}$$

where

$$\begin{aligned} l&=\min{(l_1l_2,l_1u_2,u_1l_2,u_1u_2)}\\ u&=\max{(l_1l_2,l_1u_2,u_1l_2,u_1u_2)}\\ \end{aligned}$$

At least on the randomly generated data, such estimation makes little error. However, it failed on the real data. Besides, there are a lot other operators whose output ranges can not be even estiamted by closed math formulas.

Finally, my solution is to use the training data and the original model to directly record the output range of each layer, and introduced a new hyper parameter to shrink the recorded ranges to be the quantized ranges in the quantized model. It works surprisingly good:)

Notes on Implementation

It turnes out that the performance after quantization is extremely sensitive to the rounding error. Followings are several lessons I have been taught.

  • Always use std::round instead of std::floor

    Initially, I carelessly used std::floor when mapping a float32 to uint8 at one place. The model accuracy drops to, surprisingly, 0!

  • Always choose the quantizing range such that 0 is quantized as accurate as possible, i.e $q^{-1}(q_{l,u}(0))$ should ideally be again 0.

    This principle is intended for the convolution operators, which are always 0-padded. If the quantized zero is in fact depart from the real zero a lot, it will cause much error.

Following table demonstrates the influence on performance caused by the above two approaches. The accuracy is calculated on a single batch, i.e. 32 examples.

Model Dataset normal misused std::floor not adjust range
Inception-BN ImagetNet 59.375% 0% 18.75%
VGG ImageNet 62.5% 0% 9.375%

It can be seen that those two approaches have dramatical influence on the final performance.

Results

Following table shows the result of my experiments on some typical models, the performance is Top-1 accuracy on the validation dataset:

Model Dataset Original Quantized
Small Inception-BN CIFAR10 90.97% 90.87%
Inception-BN ImagetNet 67.61% 66.75%
VGG ImageNet 70.69% 69.67%

We see that the experiment covers networks and datasets in both small and large scale, which I believe indicates that the uint8 quantization have little if not at all harm on the performance.