composable_kernel/example/62_convnd_activ/README.md

# N-Dimensional Convolution with Activation

This example demonstrates an **N-dimensional convolution forward pass fused with an activation function**. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.

## Mathematical Formulation

The operation performs an N-dimensional convolution followed immediately by an activation function.

1.  **N-Dimensional Convolution**: A standard N-dimensional forward convolution.
    $C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W})$
    Where `In` is the input tensor, `W` is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.

2.  **Activation Function**: Apply an elementwise activation function to the convolution result.
    $\text{Out} = \text{Activation}(C_{temp})$
    Common activation functions include:
    - **ReLU**: $\text{ReLU}(x) = \max(0, x)$
    - **Sigmoid**: $\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$
    - **Tanh**: $\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
    - **GELU**: $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi$ is the standard Gaussian CDF
    - **Swish**: $\text{Swish}(x) = x \cdot \text{Sigmoid}(x)$

The key optimization is that the intermediate tensor `C_temp` is **never written to global memory**. The activation function is applied directly to the convolution result held in registers.

## Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue

The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.

1.  **Implicit GEMM Core**: The convolution is transformed into an equivalent GEMM operation:
    -   **Input Transformation**: The input tensor is implicitly transformed using the im2col operation.
    -   **Matrix Multiplication**: The core computation is performed as a tiled matrix multiplication.
    -   **Output Accumulation**: Results are accumulated in registers as standard GEMM tiles.

2.  **Fused Activation Epilogue**: Before storing results to global memory:
    -   **Elementwise Activation**: Apply the activation function to each element in the accumulated tile.
    -   **Vectorized Operations**: Use vectorized instructions where possible for activation computation.
    -   **Store Activated Result**: Write the final activated output directly to global memory.

This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.

## Source Code Organization

-   [`convnd_activ_xdl.cpp`](./convnd_activ_xdl.cpp): The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the `DeviceConvNdActiv` operation.
-   [`../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp`](../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp): The device interface for N-dimensional convolution with activation fusion.
-   The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/62_convnd_activ
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run the Example
```bash
# Run the example with default settings
./convnd_activ_xdl

# Run with verification, data initialization, and timing
./convnd_activ_xdl 1 2 1
```

## Applications

Convolution with activation fusion is fundamental to many neural network architectures.

-   **Convolutional Neural Networks (CNNs)**: Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
-   **Computer Vision Models**: Image classification, object detection, and segmentation networks all benefit from this fusion.
-   **3D CNNs**: Video analysis and medical imaging applications using 3D convolutions with activations.
-   **Mobile and Edge Deployment**: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
-   **Training Acceleration**: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.

## Performance Benefits

This fusion provides several performance advantages:

-   **Reduced Memory Bandwidth**: Eliminates one full read/write cycle of the intermediate tensor
-   **Improved Cache Locality**: Data stays in cache/registers between convolution and activation
-   **Fewer Kernel Launches**: Reduces GPU kernel launch overhead
-   **Better Instruction Scheduling**: Allows better interleaving of compute and memory operations

## Activation Function Considerations

Different activation functions have different computational characteristics:

-   **ReLU**: Very fast, just a comparison and conditional assignment
-   **Sigmoid/Tanh**: Require expensive exponential calculations
-   **GELU**: Involves error function computation, typically approximated
-   **Swish**: Combines multiplication with sigmoid computation

The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.