mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-18 12:00:07 +00:00
106 lines
5.5 KiB
Markdown
106 lines
5.5 KiB
Markdown
# N-Dimensional Convolution with Activation
|
|
|
|
This example demonstrates an **N-dimensional convolution forward pass fused with an activation function**. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.
|
|
|
|
## Mathematical Formulation
|
|
|
|
The operation performs an N-dimensional convolution followed immediately by an activation function.
|
|
|
|
1. **N-Dimensional Convolution**: A standard N-dimensional forward convolution.
|
|
$C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W})$
|
|
Where `In` is the input tensor, `W` is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.
|
|
|
|
2. **Activation Function**: Apply an elementwise activation function to the convolution result.
|
|
$\text{Out} = \text{Activation}(C_{temp})$
|
|
Common activation functions include:
|
|
- **ReLU**: $\text{ReLU}(x) = \max(0, x)$
|
|
- **Sigmoid**: $\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$
|
|
- **Tanh**: $\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
|
|
- **GELU**: $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi$ is the standard Gaussian CDF
|
|
- **Swish**: $\text{Swish}(x) = x \cdot \text{Sigmoid}(x)$
|
|
|
|
The key optimization is that the intermediate tensor `C_temp` is **never written to global memory**. The activation function is applied directly to the convolution result held in registers.
|
|
|
|
## Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue
|
|
|
|
The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.
|
|
|
|
1. **Implicit GEMM Core**: The convolution is transformed into an equivalent GEMM operation:
|
|
- **Input Transformation**: The input tensor is implicitly transformed using the im2col operation.
|
|
- **Matrix Multiplication**: The core computation is performed as a tiled matrix multiplication.
|
|
- **Output Accumulation**: Results are accumulated in registers as standard GEMM tiles.
|
|
|
|
2. **Fused Activation Epilogue**: Before storing results to global memory:
|
|
- **Elementwise Activation**: Apply the activation function to each element in the accumulated tile.
|
|
- **Vectorized Operations**: Use vectorized instructions where possible for activation computation.
|
|
- **Store Activated Result**: Write the final activated output directly to global memory.
|
|
|
|
This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.
|
|
|
|
## Source Code Organization
|
|
|
|
- [`convnd_activ_xdl.cpp`](./convnd_activ_xdl.cpp): The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the `DeviceConvNdActiv` operation.
|
|
- [`../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp`](../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp): The device interface for N-dimensional convolution with activation fusion.
|
|
- The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.
|
|
|
|
## Build and Run
|
|
|
|
### Prerequisites
|
|
Ensure the Composable Kernel library is built and installed.
|
|
```bash
|
|
cd /path/to/composable_kernel/build
|
|
make -j install
|
|
```
|
|
|
|
### Build the Example
|
|
```bash
|
|
cd /path/to/composable_kernel/example/62_convnd_activ
|
|
mkdir build && cd build
|
|
|
|
cmake \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
|
|
..
|
|
|
|
make -j
|
|
```
|
|
|
|
### Run the Example
|
|
```bash
|
|
# Run the example with default settings
|
|
./convnd_activ_xdl
|
|
|
|
# Run with verification, data initialization, and timing
|
|
./convnd_activ_xdl 1 2 1
|
|
```
|
|
|
|
## Applications
|
|
|
|
Convolution with activation fusion is fundamental to many neural network architectures.
|
|
|
|
- **Convolutional Neural Networks (CNNs)**: Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
|
|
- **Computer Vision Models**: Image classification, object detection, and segmentation networks all benefit from this fusion.
|
|
- **3D CNNs**: Video analysis and medical imaging applications using 3D convolutions with activations.
|
|
- **Mobile and Edge Deployment**: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
|
|
- **Training Acceleration**: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.
|
|
|
|
## Performance Benefits
|
|
|
|
This fusion provides several performance advantages:
|
|
|
|
- **Reduced Memory Bandwidth**: Eliminates one full read/write cycle of the intermediate tensor
|
|
- **Improved Cache Locality**: Data stays in cache/registers between convolution and activation
|
|
- **Fewer Kernel Launches**: Reduces GPU kernel launch overhead
|
|
- **Better Instruction Scheduling**: Allows better interleaving of compute and memory operations
|
|
|
|
## Activation Function Considerations
|
|
|
|
Different activation functions have different computational characteristics:
|
|
|
|
- **ReLU**: Very fast, just a comparison and conditional assignment
|
|
- **Sigmoid/Tanh**: Require expensive exponential calculations
|
|
- **GELU**: Involves error function computation, typically approximated
|
|
- **Swish**: Combines multiplication with sigmoid computation
|
|
|
|
The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.
|