Files
Aviral Goel 004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
..

N-Dimensional Convolution with Activation

This example demonstrates an N-dimensional convolution forward pass fused with an activation function. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.

Mathematical Formulation

The operation performs an N-dimensional convolution followed immediately by an activation function.

  1. N-Dimensional Convolution: A standard N-dimensional forward convolution. C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W}) Where In is the input tensor, W is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.

  2. Activation Function: Apply an elementwise activation function to the convolution result. \text{Out} = \text{Activation}(C_{temp}) Common activation functions include:

    • ReLU: \text{ReLU}(x) = \max(0, x)
    • Sigmoid: \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
    • Tanh: \text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
    • GELU: \text{GELU}(x) = x \cdot \Phi(x) where \Phi is the standard Gaussian CDF
    • Swish: \text{Swish}(x) = x \cdot \text{Sigmoid}(x)

The key optimization is that the intermediate tensor C_temp is never written to global memory. The activation function is applied directly to the convolution result held in registers.

Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue

The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.

  1. Implicit GEMM Core: The convolution is transformed into an equivalent GEMM operation:

    • Input Transformation: The input tensor is implicitly transformed using the im2col operation.
    • Matrix Multiplication: The core computation is performed as a tiled matrix multiplication.
    • Output Accumulation: Results are accumulated in registers as standard GEMM tiles.
  2. Fused Activation Epilogue: Before storing results to global memory:

    • Elementwise Activation: Apply the activation function to each element in the accumulated tile.
    • Vectorized Operations: Use vectorized instructions where possible for activation computation.
    • Store Activated Result: Write the final activated output directly to global memory.

This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.

Source Code Organization

  • convnd_activ_xdl.cpp: The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the DeviceConvNdActiv operation.
  • ../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp: The device interface for N-dimensional convolution with activation fusion.
  • The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/62_convnd_activ
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./convnd_activ_xdl

# Run with verification, data initialization, and timing
./convnd_activ_xdl 1 2 1

Applications

Convolution with activation fusion is fundamental to many neural network architectures.

  • Convolutional Neural Networks (CNNs): Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
  • Computer Vision Models: Image classification, object detection, and segmentation networks all benefit from this fusion.
  • 3D CNNs: Video analysis and medical imaging applications using 3D convolutions with activations.
  • Mobile and Edge Deployment: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
  • Training Acceleration: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.

Performance Benefits

This fusion provides several performance advantages:

  • Reduced Memory Bandwidth: Eliminates one full read/write cycle of the intermediate tensor
  • Improved Cache Locality: Data stays in cache/registers between convolution and activation
  • Fewer Kernel Launches: Reduces GPU kernel launch overhead
  • Better Instruction Scheduling: Allows better interleaving of compute and memory operations

Activation Function Considerations

Different activation functions have different computational characteristics:

  • ReLU: Very fast, just a comparison and conditional assignment
  • Sigmoid/Tanh: Require expensive exponential calculations
  • GELU: Involves error function computation, typically approximated
  • Swish: Combines multiplication with sigmoid computation

The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.