Files
Kiefer van Teutem 2377a62837 Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529)
* Adding remaining flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu

* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.

* Do not build f8 / bf8 only flavor tests on RDNA3

* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.

* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.

* adding int8 and fp16 overloads to the elementwise operations

* fixed copilot nits

* Addressing review comments:

- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files

* clang-format

* reduced no of tests

---------

Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
2026-01-30 17:02:14 +01:00
..

N-Dimensional Convolution with Activation

This example demonstrates an N-dimensional convolution forward pass fused with an activation function. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.

Mathematical Formulation

The operation performs an N-dimensional convolution followed immediately by an activation function.

  1. N-Dimensional Convolution: A standard N-dimensional forward convolution. C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W}) Where In is the input tensor, W is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.

  2. Activation Function: Apply an elementwise activation function to the convolution result. \text{Out} = \text{Activation}(C_{temp}) Common activation functions include:

    • ReLU: \text{ReLU}(x) = \max(0, x)
    • Sigmoid: \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
    • Tanh: \text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
    • GELU: \text{GELU}(x) = x \cdot \Phi(x) where \Phi is the standard Gaussian CDF
    • Swish: \text{Swish}(x) = x \cdot \text{Sigmoid}(x)

The key optimization is that the intermediate tensor C_temp is never written to global memory. The activation function is applied directly to the convolution result held in registers.

Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue

The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.

  1. Implicit GEMM Core: The convolution is transformed into an equivalent GEMM operation:

    • Input Transformation: The input tensor is implicitly transformed using the im2col operation.
    • Matrix Multiplication: The core computation is performed as a tiled matrix multiplication.
    • Output Accumulation: Results are accumulated in registers as standard GEMM tiles.
  2. Fused Activation Epilogue: Before storing results to global memory:

    • Elementwise Activation: Apply the activation function to each element in the accumulated tile.
    • Vectorized Operations: Use vectorized instructions where possible for activation computation.
    • Store Activated Result: Write the final activated output directly to global memory.

This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.

Source Code Organization

  • convnd_activ_xdl.cpp: The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the DeviceConvNdActiv operation.
  • ../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp: The device interface for N-dimensional convolution with activation fusion.
  • The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/62_convnd_activ
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./convnd_activ_xdl

# Run with verification, data initialization, and timing
./convnd_activ_xdl 1 2 1

Applications

Convolution with activation fusion is fundamental to many neural network architectures.

  • Convolutional Neural Networks (CNNs): Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
  • Computer Vision Models: Image classification, object detection, and segmentation networks all benefit from this fusion.
  • 3D CNNs: Video analysis and medical imaging applications using 3D convolutions with activations.
  • Mobile and Edge Deployment: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
  • Training Acceleration: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.

Performance Benefits

This fusion provides several performance advantages:

  • Reduced Memory Bandwidth: Eliminates one full read/write cycle of the intermediate tensor
  • Improved Cache Locality: Data stays in cache/registers between convolution and activation
  • Fewer Kernel Launches: Reduces GPU kernel launch overhead
  • Better Instruction Scheduling: Allows better interleaving of compute and memory operations

Activation Function Considerations

Different activation functions have different computational characteristics:

  • ReLU: Very fast, just a comparison and conditional assignment
  • Sigmoid/Tanh: Require expensive exponential calculations
  • GELU: Involves error function computation, typically approximated
  • Swish: Combines multiplication with sigmoid computation

The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.