* LWPCK-4043: Add GPU reference implementations for CK Tile convolution
This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).
Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability
The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.
* update gpu reference for ck tile grouped conv
* correct c++ 18 format
* Add GPU Reference Implementations for Old CK Convolution
This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.
Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
* Fixed BF16 NaN issue (use type_convert instead of static_cast)
* Fixed FP8/BF8 arithmetic (accumulate in float)
* Fixed uninitialized variables
* All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)
- Created backward data GPU reference (naive_conv_bwd_data.hpp)
* Implements input gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
* Implements weight gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Integrated with old CK examples
* Forward: 10 XDL examples now support do_verification=2
* Backward data: Integrated with example/17_convnd_bwd_data/
* Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
* Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)
Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types
Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints
Result: Old CK GPU references can replace CPU references for verification
with 50-100x performance improvement for large tensors.
* Apply clang-format to old CK GPU reference files
* Fix C++17 compatibility: use brace initialization for aggregate types
* add get_rtol, get_atl and consistency cout message
* Use triple bracket syntax for kernel launch per review feedback
Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.
All tests still passing after this change.
* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode
- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)
* Use ConvDims structure to simplify GPU reference kernels
Replace 24 individual parameters with ConvDims structure per review feedback.
- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files
* Use get_block_id() and get_thread_id() helpers in CK Tile
Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.
Updated 3 CK Tile GPU reference kernels per review feedback.
* Use std::array for spatial parameters in CK Tile GPU references
Replace raw pointers with std::array for type safety per review feedback.
- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters
* Use NDimSpatial+3 for stride array sizes
Replace hardcoded [10] with [NDimSpatial+3] per review feedback.
Array sizes now correctly reflect actual dimensions needed.
* Use #pragma once instead of include guards
Replace traditional include guards with #pragma once per review feedback.
Updated 3 Old CK GPU reference headers.
* Fix element-wise operation output in Old CK GPU references
Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.
This ensures element-wise operations are correctly applied to output.
* Initialize element-wise operation variables
Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.
Updated backward data and backward weight kernels.
* Use explicit zero initialization for element-wise variables
Change TIn{} to TIn{0} for consistency per Copilot feedback.
All 3 kernels now use consistent zero initialization.
* Fix copyright headers to match existing style
- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range
Addresses consistency feedback.
* Rename GPU reference files: add _gpu suffix
* Refactor index calculations: use std::array and extract to helper functions
* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Grouped Convolution Backward Pass for Weights
This example demonstrates the backward weight pass for a grouped convolution, often denoted as grouped_conv_bwd_weight. This operation is essential for training neural networks that use grouped or depthwise convolutions, such as ResNeXt, MobileNets, and EfficientNets. Its purpose is to compute the gradient of the loss function with respect to the convolution's filter weights, which is then used by an optimizer (like SGD or Adam) to update the model's parameters.
Mathematical Formulation
The backward weight pass computes the gradient \frac{\partial L}{\partial W}, given the input tensor from the forward pass, In, and the gradient from the subsequent layer, dL/dOut.
For a single group g, the operation is mathematically equivalent to a convolution between the input tensor for that group, In_[g], and the output gradient tensor for that group, dL/dOut_[g].
\frac{\partial L}{\partial W_{[g]}} = \text{In}_{[g]} \star \frac{\partial L}{\partial \text{Out}_{[g]}}
This operation correlates the input activations with the output error signals to determine how each weight should be adjusted to reduce the overall loss. The total gradient dL/dW is the collection of gradients for all G groups.
Algorithmic Strategy: Implicit Grouped GEMM
This operation is a perfect candidate for the Grouped GEMM primitive. The convolution for each of the G groups is independently transformed into a GEMM problem, and all G GEMMs are executed in a single kernel launch.
For each group g:
-
Input to Columns (
im2col): The input tensorIn_[g]is logically unrolled into a matrixIn'_[g]. This is the sameim2coltransformation used in the forward pass. This matrix becomes the "A" matrix in the GEMM. -
Output Gradient Reshaping: The output gradient tensor
dL/dOut_[g]is logically reshaped into a matrix(dL/dOut)'_[g]. This matrix becomes the "B" matrix in the GEMM. -
Implicit Grouped GEMM: The weight gradient
dL/dW_[g]is computed by a single GEMM:(\text{dL/dW})'_{[g]} = (\text{dL/dOut})'_{[g]} \times (\text{In}'_{[g]})^T
The key to performance is that this is executed as a Grouped GEMM. The DeviceGroupedConvBwdWeight interface takes the G independent problems and maps them to a DeviceGroupedGemm kernel. This kernel schedules the G independent GEMMs across the GPU's compute units. The im2col transformation is performed implicitly; the GEMM kernel reads data directly from the original In and dL/dOut tensors in the correct pattern, avoiding the materialization of large intermediate matrices.
This approach is highly efficient as it leverages the task-parallel nature of the grouped convolution and the computational efficiency of highly optimized GEMM kernels.
Source Code Organization
grouped_conv_bwd_weight_xdl.cpp: The main example file. It sets up a grouped convolution problem and instantiates theDeviceGroupedConvBwdWeightoperation.../../include/ck/tensor_operation/gpu/device/device_grouped_conv_bwd_weight.hpp: The high-level device interface. It internally translates the grouped convolution problem into a set of arguments for theDeviceGroupedGemminterface.../../include/ck/tensor_operation/gpu/device/device_grouped_gemm.hpp: The underlying Grouped GEMM device interface that is called by the grouped convolution operator.../../library/include/ck/library/reference_tensor_operation/cpu/reference_grouped_conv_bwd_weight.hpp: A CPU reference implementation for verifying the correctness of the GPU kernel.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/20_grouped_conv_bwd_weight
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./grouped_conv_bwd_weight_xdl
# Run with verification, data initialization, and timing
./grouped_conv_bwd_weight_xdl 1 2 1
Importance in Modern CNNs
Grouped and depthwise convolutions are the cornerstone of many efficient, state-of-the-art CNN architectures.
- Parameter Efficiency: By not connecting every input channel to every output channel, grouped convolutions significantly reduce the number of weights in a layer, leading to smaller and faster models.
- Depthwise Separable Convolutions: Used in MobileNets, EfficientNets, and Xception, these layers factorize a standard convolution into a depthwise convolution (a grouped convolution with
G = C) and a pointwise convolution (1x1conv). The backward pass for the depthwise part requires an efficientgrouped_conv_bwd_weightimplementation. - ResNeXt: This architecture introduced the "cardinality" dimension, which is simply the number of groups in a grouped convolution, demonstrating that increasing the number of groups can be more effective than increasing layer depth or width.
An optimized grouped_conv_bwd_weight kernel is therefore not an exotic feature but a critical requirement for training a wide range of modern and efficient deep learning models.