Files
composable_kernel/example/62_convnd_activ/README.md
2025-10-16 10:13:27 +00:00

106 lines
5.5 KiB
Markdown

# N-Dimensional Convolution with Activation
This example demonstrates an **N-dimensional convolution forward pass fused with an activation function**. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.
## Mathematical Formulation
The operation performs an N-dimensional convolution followed immediately by an activation function.
1. **N-Dimensional Convolution**: A standard N-dimensional forward convolution.
$C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W})$
Where `In` is the input tensor, `W` is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.
2. **Activation Function**: Apply an elementwise activation function to the convolution result.
$\text{Out} = \text{Activation}(C_{temp})$
Common activation functions include:
- **ReLU**: $\text{ReLU}(x) = \max(0, x)$
- **Sigmoid**: $\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$
- **Tanh**: $\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- **GELU**: $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi$ is the standard Gaussian CDF
- **Swish**: $\text{Swish}(x) = x \cdot \text{Sigmoid}(x)$
The key optimization is that the intermediate tensor `C_temp` is **never written to global memory**. The activation function is applied directly to the convolution result held in registers.
## Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue
The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.
1. **Implicit GEMM Core**: The convolution is transformed into an equivalent GEMM operation:
- **Input Transformation**: The input tensor is implicitly transformed using the im2col operation.
- **Matrix Multiplication**: The core computation is performed as a tiled matrix multiplication.
- **Output Accumulation**: Results are accumulated in registers as standard GEMM tiles.
2. **Fused Activation Epilogue**: Before storing results to global memory:
- **Elementwise Activation**: Apply the activation function to each element in the accumulated tile.
- **Vectorized Operations**: Use vectorized instructions where possible for activation computation.
- **Store Activated Result**: Write the final activated output directly to global memory.
This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.
## Source Code Organization
- [`convnd_activ_xdl.cpp`](./convnd_activ_xdl.cpp): The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the `DeviceConvNdActiv` operation.
- [`../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp`](../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp): The device interface for N-dimensional convolution with activation fusion.
- The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.
## Build and Run
### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```
### Build the Example
```bash
cd /path/to/composable_kernel/example/62_convnd_activ
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
```
### Run the Example
```bash
# Run the example with default settings
./convnd_activ_xdl
# Run with verification, data initialization, and timing
./convnd_activ_xdl 1 2 1
```
## Applications
Convolution with activation fusion is fundamental to many neural network architectures.
- **Convolutional Neural Networks (CNNs)**: Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
- **Computer Vision Models**: Image classification, object detection, and segmentation networks all benefit from this fusion.
- **3D CNNs**: Video analysis and medical imaging applications using 3D convolutions with activations.
- **Mobile and Edge Deployment**: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
- **Training Acceleration**: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.
## Performance Benefits
This fusion provides several performance advantages:
- **Reduced Memory Bandwidth**: Eliminates one full read/write cycle of the intermediate tensor
- **Improved Cache Locality**: Data stays in cache/registers between convolution and activation
- **Fewer Kernel Launches**: Reduces GPU kernel launch overhead
- **Better Instruction Scheduling**: Allows better interleaving of compute and memory operations
## Activation Function Considerations
Different activation functions have different computational characteristics:
- **ReLU**: Very fast, just a comparison and conditional assignment
- **Sigmoid/Tanh**: Require expensive exponential calculations
- **GELU**: Involves error function computation, typically approximated
- **Swish**: Combines multiplication with sigmoid computation
The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.