Refactor and integrate CK GPU references into ckProfiler. - All convolution layouts and groupings supported for all three directions - Unit tests verifying GPU and CPU reference is the same - Support added to profiler (do_verification = 2 enables GPU reference) - One profiler-based test per direction changed to GPU reference to demonstrate usag Closes AICK-427
N-Dimensional Convolution Backward Pass for Data
This example demonstrates the backward data pass of an N-dimensional convolution, often denoted as conv_bwd_data. This operation is a crucial part of the backpropagation algorithm for training Convolutional Neural Networks (CNNs). Its purpose is to compute the gradient of the loss function with respect to the convolution's input data, which is then passed back to the preceding layer in the network.
Mathematical Formulation
The backward data pass computes the gradient \frac{\partial L}{\partial \text{In}}, given the gradient from the subsequent layer, \frac{\partial L}{\partial \text{Out}}, and the filter weights W used in the forward pass.
Let the forward convolution be defined as:
\text{Out} = \text{In} \star W
The backward data pass is mathematically equivalent to a "full" convolution between the output gradient tensor dL/dOut and the 180-degree rotated (or transposed and flipped) weight tensor W.
\frac{\partial L}{\partial \text{In}} = \frac{\partial L}{\partial \text{Out}} \star \text{rot180}(W)
This operation propagates the error signal from the output back to the input, weighted by the same filters that were used in the forward pass.
Algorithmic Strategy: Implicit GEMM
As with the forward pass, the most efficient way to implement the backward data pass on a GPU is to transform the convolution into a General Matrix-Matrix Multiplication (GEMM) problem.
-
Output Gradient Reshaping: The output gradient tensor
dL/dOutis logically reshaped into a matrixdL/dOut'of shape[K, (N*Ho*Wo)]. This becomes the "A" matrix in the GEMM. -
Weight Reshaping: The weight tensor
Wis logically reshaped into a matrixW'of shape[K, (C*Y*X)]. This becomes the "B" matrix in the GEMM. -
Implicit GEMM: The core computation is then formulated as a GEMM operation. However, the output of this GEMM is not a simple matrix; it's the
dL/dIntensor.(\text{dL/dIn})' = (W')^T \times (\text{dL/dOut})'The key insight is that this operation can be performed without explicitly forming the matrices. The GEMM kernel is designed to read from
dL/dOutandWand write its results directly to the appropriate locations in thedL/dIntensor. This process is sometimes referred to as an "implicitcol2im" (column-to-image), as it is the inverse of theim2coltransformation used in the forward pass.
This "implicit GEMM" approach is highly efficient. It avoids the massive memory and bandwidth overhead of materializing intermediate matrices, which is critical for performance.
Source Code Organization
conv_bwd_data_xdl.cpp: The main example file that defines the parameters for a 2D convolution and instantiates the genericDeviceConvNdBwdDatakernel to compute the input gradients.../../include/ck/tensor_operation/gpu/device/device_conv_bwd_data.hpp: The high-level device interface for the backward data convolution. It is templated on the dimensionality, layouts, and data types of the problem.../../include/ck/tensor_operation/gpu/grid/gridwise_gemm_implicit_gemm_v1r2_xdlops_nchw_kcyx_nkhw.hpp: An example of a specific grid-wise kernel that implements the implicit GEMM algorithm for the backward data pass. The library contains multiple such kernels optimized for different layouts and problem types, and theDeviceConvNdBwdDatainterface selects the most appropriate one.../../library/include/ck/library/reference_tensor_operation/cpu/reference_conv_bwd_data.hpp: A CPU reference implementation used to verify the correctness of the GPU kernel's output.
Build and Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build the Example
cd /path/to/composable_kernel/example/17_convnd_bwd_data
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
#arg1: verification (0=no, 1=yes)
#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
#arg3: run kernel # of times (>1)
#arg4: num_dim_spatial(1|2|3)
#arg5 to ...: N, K, C, [Z,] [Y,] X, [Di,] [Hi,] Wi, S[z,] [Sy,] Sx, [Dz,] [Dy,] Dx, [LeftPz,] [LeftPy,] LeftPx, [RightPy,] [RightPy,] RightPx
./bin/example_convnd_bwd_data_xdl 0 1 5
Result
in_n_c_hi_wi: dim 4, lengths {128, 128, 71, 71}, strides {645248, 1, 9088, 128}
wei_k_c_y_x: dim 4, lengths {256, 128, 3, 3}, strides {1152, 1, 384, 128}
out_n_k_ho_wo: dim 4, lengths {128, 256, 36, 36}, strides {331776, 1, 9216, 256}
arg.a_grid_desc_k0_m_k1_container_{128, 175232, 8}
arg.b_grid_desc_k0_n_k1_container_{128, 128, 8}
arg.c_grid_desc_m_n_container_{ 175232, 128}
arg.c_grid_desc_m0_n0_m1_n1_m2_m3_m4_n2_container_( 2738, 2, 2, 2, 4, 2 )
launch_and_time_kernel: grid_dim {1369, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
arg.a_grid_desc_k0_m_k1_container_{64, 175232, 8}
arg.b_grid_desc_k0_n_k1_container_{64, 128, 8}
arg.c_grid_desc_m_n_container_{ 175232, 128}
arg.c_grid_desc_m0_n0_m1_n1_m2_m3_m4_n2_container_( 2738, 2, 2, 2, 4, 2 )
launch_and_time_kernel: grid_dim {1369, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
arg.a_grid_desc_k0_m_k1_container_{64, 175232, 8}
arg.b_grid_desc_k0_n_k1_container_{64, 128, 8}
arg.c_grid_desc_m_n_container_{ 175232, 128}
arg.c_grid_desc_m0_n0_m1_n1_m2_m3_m4_n2_container_( 2738, 2, 2, 2, 4, 2 )
launch_and_time_kernel: grid_dim {1369, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
arg.a_grid_desc_k0_m_k1_container_{32, 175232, 8}
arg.b_grid_desc_k0_n_k1_container_{32, 128, 8}
arg.c_grid_desc_m_n_container_{ 175232, 128}
arg.c_grid_desc_m0_n0_m1_n1_m2_m3_m4_n2_container_( 2738, 2, 2, 2, 4, 2 )
launch_and_time_kernel: grid_dim {1369, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Perf: 1.40031 ms, 69.8734 TFlops, 179.037 GB/s
Relationship to Other Passes
The training of a single convolutional layer requires three distinct steps:
- Forward Pass (
conv_fwd): Computes the output feature maps.Out = In * W
- Backward Data Pass (
conv_bwd_data): Computes the gradient with respect to the input, propagating the error to the previous layer. This is the focus of the current example.dL/dIn = dL/dOut * rot180(W)
- Backward Weight Pass (
conv_bwd_weight): Computes the gradient with respect to the weights, which is needed for the weight update.dL/dW = In * dL/dOut
All three passes are critical for training a CNN, and all are typically implemented as high-performance implicit GEMM operations.