This is a short memo to record the formula I used in implementing convolution.

# Problem Parameters

The problem is formalized as:

- inputs
- a $(N, C, H, W)$ tensor as the data
- a $(K, C, R, S)$ tensor as the kernel

- output
- a $(N, K, P, Q)$ tensor as the convolution (in fact correlation) result

- other parameters
- $u,v$ the vertical and horizontal strides
- $pad_h, pad_w$ the vertical and horizontal padding

$$P = \left\lceil\frac{H-R+1+2pad_h}{u}\right\rceil = \left\lfloor\frac{H-R+2pad_h}{u}\right\rfloor+1$$

while $Q$ is calculated similarly.

# Approach

We calculate the convolution in three steps:

## im2col

As demonstrated in the paper by Nvidia,

We convert the $(N, C, H, W)$ data into a $(CRS, NPQ)$ matrix, and the $(K, C, R, S)$ kernel into a $(K, CRS)$ matrix. The latter is easy since we only need to reshape by concatenating the last three dimensions. The former task is the so-called `im2col`

.

The way I did it is to find the positions in the target $(CRS, NPQ)$ matrix to which the value at $(i, j)$ in the original data will be mapped into by the rule demonstrated in the above figure. Letâ€™s ignore the $N, C$ dimensions for a moment, and consider mapping a $(H,W)$ matrix into a $(RS, PQ)$ matrix. The final mapping is:

$$\label{im2col} \begin{align} \lambda: (i, j) \mapsto \left\{\left(i'S+j', \frac{i-i'}{u}Q+\frac{j-j'}{v}\right)\middle|i\equiv i'(\textrm{mod}\ u), j\equiv j'(\textrm{mod}\ v)\right\} \end{align}$$where

$$\begin{aligned} &\max{(0, i+R-H-2pad_h)\leqslant i'\leqslant \min{(R-1, i)}}\\ &\max{(0, j+S-W-2pad_w)\leqslant j'\leqslant \min{(S-1, j)}} \end{aligned}$$The idea behind the formula is: for each $(i, j)$ of the original image, if *a* mapped position in the target matrix is $(x,y)$, then

- $y$ is the index of the kernel multiplying currently.
- $x$ is the index of position $(i',j')$ in the kernel in row-major order.

Suppose the position *in the kernel* which is multiplying $(i,j)$ is $(i',j')$, after a little thinking we can figure out that

However, $(i',j')$ can not take arbitrary values. Firstly they must be within $(R,S)$, the size of kernel, i.e.

$$\begin{aligned} 0&\leqslant i'\leqslant R-1\\ 0&\leqslant j'\leqslant S-1 \end{aligned}$$Besides, the *left-top position* of the kernel, i.e. $(i-i', j-j')$, must lie validly within the original padded image, which implies:

Furthermore, considering strides $u,v$, we have $(u,v)$ must devide $(i-i',j-j')$ respectively.

Combine all of those above, we finally reach the formula $(\ref{im2col})$, which in fact is also sufficient.

Based on this, we can take $(N,C)$ back to get the final formula:

$$(n, c, i, j) \mapsto (cRS, nPQ) + \lambda(i,j)$$## gemm

An intuitive description of GEMM can be found here. We calculate the matrix multiplication of a $(K, CRS)$ matrix with $(CRS, NPQ)$ matrix to get a $(K, NPQ)$ matrix.

I used the uint8 gemm API provided by Nvidia, which is described in UInt8 Matrix Multiplication.

## transform

At last, we must turn the $(K, NPQ)$ matrix to a $(N, KPQ)$ matrix, then a $(N,K,P,Q)$ tensor. The first transform is essentially a matrix transpose if we view the inner most $PQ$ elements as a single block. Since in the memory those $PQ$ values are stored contiguously, we can use `memcpy`

to manually implement block-wise matrix transpose. After that, a simple reshaping will turn the result into the final $(N,K,P,Q)$ tensor.