In this post I will introduce how to do uint8 matrix multiplication using a hidden API in CUDA 8.0.

# Mathematics of Quantized Matrix Multiplication

Given two matrices $A, B$ in $\mathbb{R}$, we need to calculate their product in the quantized space $Q_U(\mathbb{R})$ where $U$ is the ring of uint8 integers.

Suppose $A$ is mapped to $(\hat{A}, m_A, M_A)$, and $B$ is mapped to $(\hat{B}, m_B, M_B)$, we have in $\mathbb{R}$:

\label{corg} \begin{align} \begin{split} C=AB&\approx (\hat{A}\Delta_A+m_A)(\hat{B}\Delta_B+m_B)\\ &=\Delta_A\Delta_B\left(\hat{A}+\frac{m_A}{\Delta_A}\right)\left(\hat{B}+\frac{m_B}{\Delta_B}\right)\\ &\approx\Delta_A\Delta_B\left(\hat{A}+\left[\frac{m_A}{\Delta_A}\right]\right)\left(\hat{B}+\left[\frac{m_B}{\Delta_B}\right]\right) \end{split} \end{align}

where the addition is performed elementwisely. Note that the multiplication between two brackets is integer product (although not uint8)!

Then we quantize $C$ back to the quantized space, we have

$$\hat{C} = \left[\frac{C-m_C}{\Delta_C}\right]$$

Put formula (1) in, we get

\label{chat} \begin{align} \hat{C}&\triangleq\frac{\Delta_A\Delta_B}{\Delta_C}\left(\left(\hat{A}+\left[\frac{m_A}{\Delta_A}\right]\right)\left(\hat{B}+\left[\frac{m_B}{\Delta_B}\right]\right)-\frac{m_C}{\Delta_A\Delta_B}\right) \end{align}

The reason for writing like this will be clear later.

# Implement Quantized Matrix Multiplication

My task is to run model on GPU, so I look into CUDA 8.0, which is said to support int8/uint8 operations, although turns out to be weak and disapporting. Luckily, in cublas_api.h I find an API

From the name we can tell it must be calculating the uint8 GEMM. Strangely, there is not a single word about it in the documentation, so I have to try, guess, and test by myself.

Initially I just simply set all *_bias to 0, and confirmed that it does calculate the uint8 matrix multiplication, and handles over/under-flow by capping to 255 and 0. After that, I did not investigate any deeper.

Things changed when I incidentally read a blog named How to Quantize Neural Networks with TensorFlow, which leads me to a piece of code of TensorFlow. From the file name quantized_matmul_op.cc we can tell it is used for matrix multiplication. In the file, I noticed a function:

It looks JUST LIKE the previous CUDA API, for example it also have parameters named offset_*. I immediately had a sense that it is also performing uint8 matrix multiplication. What is more important is that from several papers by Google, I know it uses the linear quantization the same as me. As a result, I began to re-examine the cublasUint8gemmBias API in CUDA. After some guesses and experiments, I finally figured out the formula it is calculating:

$$C = 2^{-C_{shift}}\cdot C_{mult}\cdot ((A-A_{bias})(B-B_{bias})+C_{bias})$$

where $A$, $B$ and the result $C$ are all in uint8, while other variables are int. Compare it to formula $(\ref{chat})$, we can figure out the value of each variables:

\begin{aligned} A_{bias}&\triangleq-\left[\frac{m_A}{\Delta_A}\right]\\ B_{bias}&\triangleq-\left[\frac{m_B}{\Delta_B}\right]\\ C_{bias}&\triangleq\frac{m_C}{\Delta_A\Delta_B}\\ 2^{-C_{shift}}\cdot C_{mult}&\triangleq\frac{\Delta_A\Delta_B}{\Delta_C} \end{aligned}

The first three formulas are clear, while the last one shows an issue: how to use two integers $a$ and $b$ to estimate a real number $r$ such at $2^{-a}b\approx r$? Turns out it is simple. Suppose we allow error $\epsilon$, we have:

$$| 2^{-a}b-r |<\epsilon \Rightarrow | b-2^ar|<2^a\epsilon$$

The left-hand-side is the difference of two integers, which is at least 1, so we can choose $\epsilon=2^{-30}\approx 10^{-9}, a=30, b=[2^ar]$, which leads to

\begin{aligned} C_{shift}&=30\\ C_{mult}&=\left[2^{30} \frac{\Delta_A\Delta_B}{\Delta_C}\right] \end{aligned}

Now, we have completely unmystified this undocumented API!

One more word, it turns out that the parameter C_shift is at most 30, otherwise the result will be completely a mess.

# Code

Following is the function to calculate the multiplication of two uint8 matrices.