In this post I will introduce how to do uint8 matrix multiplication using a hidden API in CUDA 8.0.
Mathematics of Quantized Matrix Multiplication
Given two matrices $A, B$ in $\mathbb{R}$, we need to calculate their product in the quantized space $Q_U(\mathbb{R})$ where $U$ is the ring of uint8 integers.
Suppose $A$ is mapped to $(\hat{A}, m_A, M_A)$, and $B$ is mapped to $(\hat{B}, m_B, M_B)$, we have in $\mathbb{R}$:
$$\label{corg} \begin{align} \begin{split} C=AB&\approx (\hat{A}\Delta_A+m_A)(\hat{B}\Delta_B+m_B)\\ &=\Delta_A\Delta_B\left(\hat{A}+\frac{m_A}{\Delta_A}\right)\left(\hat{B}+\frac{m_B}{\Delta_B}\right)\\ &\approx\Delta_A\Delta_B\left(\hat{A}+\left[\frac{m_A}{\Delta_A}\right]\right)\left(\hat{B}+\left[\frac{m_B}{\Delta_B}\right]\right) \end{split} \end{align}$$where the addition is performed elementwisely. Note that the multiplication between two brackets is integer product (although not uint8)!
Then we quantize $C$ back to the quantized space, we have
$$\hat{C} = \left[\frac{Cm_C}{\Delta_C}\right]$$Put formula (1) in, we get
$$\label{chat} \begin{align} \hat{C}&\triangleq\frac{\Delta_A\Delta_B}{\Delta_C}\left(\left(\hat{A}+\left[\frac{m_A}{\Delta_A}\right]\right)\left(\hat{B}+\left[\frac{m_B}{\Delta_B}\right]\right)\frac{m_C}{\Delta_A\Delta_B}\right) \end{align}$$The reason for writing like this will be clear later.
Implement Quantized Matrix Multiplication
My task is to run model on GPU, so I look into CUDA 8.0, which is said to support int8/uint8 operations, although turns out to be weak and disapporting. Luckily, in cublas_api.h
I find an API


From the name we can tell it must be calculating the uint8 GEMM. Strangely, there is not a single word about it in the documentation, so I have to try, guess, and test by myself.
Initially I just simply set all *_bias
to 0, and confirmed that it does calculate the uint8 matrix multiplication, and handles over/underflow by capping to 255 and 0. After that, I did not investigate any deeper.
Things changed when I incidentally read a blog named How to Quantize Neural Networks with TensorFlow, which leads me to a piece of code of TensorFlow. From the file name quantized_matmul_op.cc we can tell it is used for matrix multiplication. In the file, I noticed a function:


It looks JUST LIKE the previous CUDA API, for example it also have parameters named offset_*
. I immediately had a sense that it is also performing uint8 matrix multiplication. What is more important is that from several papers by Google, I know it uses the linear quantization the same as me. As a result, I began to reexamine the cublasUint8gemmBias
API in CUDA. After some guesses and experiments, I finally figured out the formula it is calculating:
where $A$, $B$ and the result $C$ are all in uint8, while other variables are int. Compare it to formula $(\ref{chat})$, we can figure out the value of each variables:
$$\begin{aligned} A_{bias}&\triangleq\left[\frac{m_A}{\Delta_A}\right]\\ B_{bias}&\triangleq\left[\frac{m_B}{\Delta_B}\right]\\ C_{bias}&\triangleq\frac{m_C}{\Delta_A\Delta_B}\\ 2^{C_{shift}}\cdot C_{mult}&\triangleq\frac{\Delta_A\Delta_B}{\Delta_C} \end{aligned}$$The first three formulas are clear, while the last one shows an issue: how to use two integers $a$ and $b$ to estimate a real number $r$ such at $2^{a}b\approx r$? Turns out it is simple. Suppose we allow error $\epsilon$, we have:
$$ 2^{a}br <\epsilon \Rightarrow  b2^ar<2^a\epsilon$$The lefthandside is the difference of two integers, which is at least 1, so we can choose $\epsilon=2^{30}\approx 10^{9}, a=30, b=[2^ar]$, which leads to
$$\begin{aligned} C_{shift}&=30\\ C_{mult}&=\left[2^{30} \frac{\Delta_A\Delta_B}{\Delta_C}\right] \end{aligned}$$Now, we have completely unmystified this undocumented API!
One more word, it turns out that the parameter C_shift
is at most 30, otherwise the result will be completely a mess.
Code
Following is the function to calculate the multiplication of two uint8 matrices.

