UInt8 Matrix Multiplication

In this post I will introduce how to do uint8 matrix multiplication using a hidden API in CUDA 8.0.

Mathematics of Quantized Matrix Multiplication

Given two matrices $A, B$ in $\mathbb{R}$, we need to calculate their product in the quantized space $Q_U(\mathbb{R})$ where $U$ is the ring of uint8 integers.

Suppose $A$ is mapped to $(\hat{A}, m_A, M_A)$, and $B$ is mapped to $(\hat{B}, m_B, M_B)$, we have in $\mathbb{R}$:

$$\label{corg} \begin{align} \begin{split} C=AB&\approx (\hat{A}\Delta_A+m_A)(\hat{B}\Delta_B+m_B)\\ &=\Delta_A\Delta_B\left(\hat{A}+\frac{m_A}{\Delta_A}\right)\left(\hat{B}+\frac{m_B}{\Delta_B}\right)\\ &\approx\Delta_A\Delta_B\left(\hat{A}+\left[\frac{m_A}{\Delta_A}\right]\right)\left(\hat{B}+\left[\frac{m_B}{\Delta_B}\right]\right) \end{split} \end{align}$$

where the addition is performed elementwisely. Note that the multiplication between two brackets is integer product (although not uint8)!

Then we quantize $C$ back to the quantized space, we have

$$\hat{C} = \left[\frac{C-m_C}{\Delta_C}\right]$$

Put formula (1) in, we get

$$\label{chat} \begin{align} \hat{C}&\triangleq\frac{\Delta_A\Delta_B}{\Delta_C}\left(\left(\hat{A}+\left[\frac{m_A}{\Delta_A}\right]\right)\left(\hat{B}+\left[\frac{m_B}{\Delta_B}\right]\right)-\frac{m_C}{\Delta_A\Delta_B}\right) \end{align}$$

The reason for writing like this will be clear later.

Implement Quantized Matrix Multiplication

My task is to run model on GPU, so I look into CUDA 8.0, which is said to support int8/uint8 operations, although turns out to be weak and disapporting. Luckily, in cublas_api.h I find an API

1
2
3
4
5
6
7
/*...*/cublasUint8gemmBias (cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb, cublasOperation_t transc,
int m, int n, int k,
const unsigned char *A, int A_bias, int lda,
const unsigned char *B, int B_bias, int ldb,
unsigned char *C, int C_bias, int ldc,
int C_mult, int C_shift);

From the name we can tell it must be calculating the uint8 GEMM. Strangely, there is not a single word about it in the documentation, so I have to try, guess, and test by myself.

Initially I just simply set all *_bias to 0, and confirmed that it does calculate the uint8 matrix multiplication, and handles over/under-flow by capping to 255 and 0. After that, I did not investigate any deeper.

Things changed when I incidentally read a blog named How to Quantize Neural Networks with TensorFlow, which leads me to a piece of code of TensorFlow. From the file name quantized_matmul_op.cc we can tell it is used for matrix multiplication. In the file, I noticed a function:

1
2
3
GemmlowpMultiply<false, false, false>(context, a_data, b_data, c_data,
m, n, k, offset_a, offset_b,
lda, ldb, ldc);

It looks JUST LIKE the previous CUDA API, for example it also have parameters named offset_*. I immediately had a sense that it is also performing uint8 matrix multiplication. What is more important is that from several papers by Google, I know it uses the linear quantization the same as me. As a result, I began to re-examine the cublasUint8gemmBias API in CUDA. After some guesses and experiments, I finally figured out the formula it is calculating:

$$C = 2^{-C_{shift}}\cdot C_{mult}\cdot ((A-A_{bias})(B-B_{bias})+C_{bias})$$

where $A$, $B$ and the result $C$ are all in uint8, while other variables are int. Compare it to formula $(\ref{chat})$, we can figure out the value of each variables:

$$\begin{aligned} A_{bias}&\triangleq-\left[\frac{m_A}{\Delta_A}\right]\\ B_{bias}&\triangleq-\left[\frac{m_B}{\Delta_B}\right]\\ C_{bias}&\triangleq\frac{m_C}{\Delta_A\Delta_B}\\ 2^{-C_{shift}}\cdot C_{mult}&\triangleq\frac{\Delta_A\Delta_B}{\Delta_C} \end{aligned}$$

The first three formulas are clear, while the last one shows an issue: how to use two integers $a$ and $b$ to estimate a real number $r$ such at $2^{-a}b\approx r$? Turns out it is simple. Suppose we allow error $\epsilon$, we have:

$$| 2^{-a}b-r |<\epsilon \Rightarrow | b-2^ar|<2^a\epsilon$$

The left-hand-side is the difference of two integers, which is at least 1, so we can choose $\epsilon=2^{-30}\approx 10^{-9}, a=30, b=[2^ar]$, which leads to

$$\begin{aligned} C_{shift}&=30\\ C_{mult}&=\left[2^{30} \frac{\Delta_A\Delta_B}{\Delta_C}\right] \end{aligned}$$

Now, we have completely unmystified this undocumented API!

One more word, it turns out that the parameter C_shift is at most 30, otherwise the result will be completely a mess.

Code

Following is the function to calculate the multiplication of two uint8 matrices.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
void matrixMulGPU(
cublasHandle_t handle,
uint8_t *d_C, const uint8_t *d_A, const uint8_t *d_B,
const int m, const int n, const int k,
float minA, float maxA, float minB, float maxB,
float &minC, float &maxC)
{
printf("minA=%f maxA=%f minB=%f maxB=%f\n", minA, maxA, minB, maxB);
float deltaA = (maxA - minA) / (float)UCHAR_MAX;
float deltaB = (maxB - minB) / (float)UCHAR_MAX;
printf("deltaA=%f deltaB=%f\n", deltaA, deltaB);
minC = min(min(minA * maxB, maxA * minB), min(minA * minB, maxA * maxB)) * n;
maxC = max(max(minA * maxB, maxA * minB), max(minA * minB, maxA * maxB)) * n;
float deltaC = (maxC - minC) / (float)UCHAR_MAX;
printf("minC=%f maxC=%f\n", minC, maxC);
printf("deltaC=%f\n", deltaC);
int a_bias = -round(minA / deltaA);
int b_bias = -round(minB / deltaB);
int c_bias = -round(minC / (deltaA * deltaB));
printf("a_bias=%d b_bias=%d c_bias=%d\n", a_bias, b_bias, c_bias);
float scale = deltaA * deltaB / deltaC;
printf("scale=%f\n", scale);
int c_shift = 30;
int c_mul = round(scale * (1 << c_shift));
printf("c_mul=%d c_shift=%d\n", c_mul, c_shift);
cublasUint8gemmBias(
handle,
CUBLAS_OP_N, CUBLAS_OP_N, CUBLAS_OP_N,
k, m, n,
d_B, b_bias, k,
d_A, a_bias, n,
d_C, c_bias, k,
c_mul, c_shift
);
}