Files
composable_kernel/example/62_convnd_activ
JH-Leon-KIM-AMD 9a5d1ea791 [rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6)
[CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208)

## Motivation

Improve performance of grouped convolution backward-data kernels to
match non-grouped kernel performance for G=1 cases.

## Technical Details

- Add NoShuffle epilogue path (direct VGPR→Global writes) by setting
`CDEBlockTransferScalarPerVector_NPerBlock = 1`
- Add nongrouped-match instances with optimized BBlockTransfer
parameters for better thread utilization
- Add packed (flat) descriptor path for G=1 2D convolutions, using
simpler tensor descriptors with fewer transform layers to reduce address
computation overhead in the GEMM main loop
- Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX
zeroing cost)

## Test Plan

- Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16)
- Verify correctness with existing conv bwd data tests

## Test Result

| Metric | Before | After |
|--------|--------|-------|
| Mean ratio (grouped/nongrouped) | 1.159 | **1.028** |
| Median ratio | 1.142 | **1.026** |
| Cases within 2% | 26 (4.4%) | **186 (31.8%)** |
| Cases >20% slower | 188 (32%) | **2 (0.3%)** |

NoShuffle + nongrouped-match instances achieve **~2.8% average gap**
with non-grouped kernels (down from ~16%).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com>
Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com>
Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 06:49:50 -07:00
..

N-Dimensional Convolution with Activation

This example demonstrates an N-dimensional convolution forward pass fused with an activation function. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.

Mathematical Formulation

The operation performs an N-dimensional convolution followed immediately by an activation function.

  1. N-Dimensional Convolution: A standard N-dimensional forward convolution. C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W}) Where In is the input tensor, W is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.

  2. Activation Function: Apply an elementwise activation function to the convolution result. \text{Out} = \text{Activation}(C_{temp}) Common activation functions include:

    • ReLU: \text{ReLU}(x) = \max(0, x)
    • Sigmoid: \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
    • Tanh: \text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
    • GELU: \text{GELU}(x) = x \cdot \Phi(x) where \Phi is the standard Gaussian CDF
    • Swish: \text{Swish}(x) = x \cdot \text{Sigmoid}(x)

The key optimization is that the intermediate tensor C_temp is never written to global memory. The activation function is applied directly to the convolution result held in registers.

Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue

The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.

  1. Implicit GEMM Core: The convolution is transformed into an equivalent GEMM operation:

    • Input Transformation: The input tensor is implicitly transformed using the im2col operation.
    • Matrix Multiplication: The core computation is performed as a tiled matrix multiplication.
    • Output Accumulation: Results are accumulated in registers as standard GEMM tiles.
  2. Fused Activation Epilogue: Before storing results to global memory:

    • Elementwise Activation: Apply the activation function to each element in the accumulated tile.
    • Vectorized Operations: Use vectorized instructions where possible for activation computation.
    • Store Activated Result: Write the final activated output directly to global memory.

This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.

Source Code Organization

  • convnd_activ_xdl.cpp: The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the DeviceConvNdActiv operation.
  • ../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp: The device interface for N-dimensional convolution with activation fusion.
  • The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/62_convnd_activ
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./convnd_activ_xdl

# Run with verification, data initialization, and timing
./convnd_activ_xdl 1 2 1

Applications

Convolution with activation fusion is fundamental to many neural network architectures.

  • Convolutional Neural Networks (CNNs): Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
  • Computer Vision Models: Image classification, object detection, and segmentation networks all benefit from this fusion.
  • 3D CNNs: Video analysis and medical imaging applications using 3D convolutions with activations.
  • Mobile and Edge Deployment: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
  • Training Acceleration: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.

Performance Benefits

This fusion provides several performance advantages:

  • Reduced Memory Bandwidth: Eliminates one full read/write cycle of the intermediate tensor
  • Improved Cache Locality: Data stays in cache/registers between convolution and activation
  • Fewer Kernel Launches: Reduces GPU kernel launch overhead
  • Better Instruction Scheduling: Allows better interleaving of compute and memory operations

Activation Function Considerations

Different activation functions have different computational characteristics:

  • ReLU: Very fast, just a comparison and conditional assignment
  • Sigmoid/Tanh: Require expensive exponential calculations
  • GELU: Involves error function computation, typically approximated
  • Swish: Combines multiplication with sigmoid computation

The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.