* Adding remaining flavors for grouped conv fwd As titled. Following variants are added: - grouped_conv2d_fwd_dynamic_op - grouped_conv3d_fwd_dynamic_op - grouped_conv3d_fwd_bilinear - grouped_conv3d_fwd_convscale - grouped_conv3d_fwd_convinvscale - grouped_conv3d_fwd_convscale_add - grouped_conv3d_fwd_convscale_relu - grouped_conv3d_fwd_scale - grouped_conv3d_fwd_combconvscale - grouped_conv3d_fwd_scaleadd_scaleadd_relu * Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3. * Do not build f8 / bf8 only flavor tests on RDNA3 * Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time. * Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream. * adding int8 and fp16 overloads to the elementwise operations * fixed copilot nits * Addressing review comments: - removed unnecessary examples for dynamic op - removed unnecessary conv specalizations for all the flavors - removed spurious bilinear and scale source files * clang-format * reduced no of tests --------- Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
N-Dimensional Convolution with Activation
This example demonstrates an N-dimensional convolution forward pass fused with an activation function. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.
Mathematical Formulation
The operation performs an N-dimensional convolution followed immediately by an activation function.
-
N-Dimensional Convolution: A standard N-dimensional forward convolution.
C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W})WhereInis the input tensor,Wis the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional. -
Activation Function: Apply an elementwise activation function to the convolution result.
\text{Out} = \text{Activation}(C_{temp})Common activation functions include:- ReLU:
\text{ReLU}(x) = \max(0, x) - Sigmoid:
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}} - Tanh:
\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} - GELU:
\text{GELU}(x) = x \cdot \Phi(x)where\Phiis the standard Gaussian CDF - Swish:
\text{Swish}(x) = x \cdot \text{Sigmoid}(x)
- ReLU:
The key optimization is that the intermediate tensor C_temp is never written to global memory. The activation function is applied directly to the convolution result held in registers.
Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue
The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.
-
Implicit GEMM Core: The convolution is transformed into an equivalent GEMM operation:
- Input Transformation: The input tensor is implicitly transformed using the im2col operation.
- Matrix Multiplication: The core computation is performed as a tiled matrix multiplication.
- Output Accumulation: Results are accumulated in registers as standard GEMM tiles.
-
Fused Activation Epilogue: Before storing results to global memory:
- Elementwise Activation: Apply the activation function to each element in the accumulated tile.
- Vectorized Operations: Use vectorized instructions where possible for activation computation.
- Store Activated Result: Write the final activated output directly to global memory.
This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.
Source Code Organization
convnd_activ_xdl.cpp: The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates theDeviceConvNdActivoperation.../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp: The device interface for N-dimensional convolution with activation fusion.- The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/62_convnd_activ
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./convnd_activ_xdl
# Run with verification, data initialization, and timing
./convnd_activ_xdl 1 2 1
Applications
Convolution with activation fusion is fundamental to many neural network architectures.
- Convolutional Neural Networks (CNNs): Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
- Computer Vision Models: Image classification, object detection, and segmentation networks all benefit from this fusion.
- 3D CNNs: Video analysis and medical imaging applications using 3D convolutions with activations.
- Mobile and Edge Deployment: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
- Training Acceleration: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.
Performance Benefits
This fusion provides several performance advantages:
- Reduced Memory Bandwidth: Eliminates one full read/write cycle of the intermediate tensor
- Improved Cache Locality: Data stays in cache/registers between convolution and activation
- Fewer Kernel Launches: Reduces GPU kernel launch overhead
- Better Instruction Scheduling: Allows better interleaving of compute and memory operations
Activation Function Considerations
Different activation functions have different computational characteristics:
- ReLU: Very fast, just a comparison and conditional assignment
- Sigmoid/Tanh: Require expensive exponential calculations
- GELU: Involves error function computation, typically approximated
- Swish: Combines multiplication with sigmoid computation
The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.