mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 16:59:10 +00:00

Files

JH-Leon-KIM-AMD 9a5d1ea791 [rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6)

[CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208)

## Motivation

Improve performance of grouped convolution backward-data kernels to
match non-grouped kernel performance for G=1 cases.

## Technical Details

- Add NoShuffle epilogue path (direct VGPR→Global writes) by setting
`CDEBlockTransferScalarPerVector_NPerBlock = 1`
- Add nongrouped-match instances with optimized BBlockTransfer
parameters for better thread utilization
- Add packed (flat) descriptor path for G=1 2D convolutions, using
simpler tensor descriptors with fewer transform layers to reduce address
computation overhead in the GEMM main loop
- Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX
zeroing cost)

## Test Plan

- Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16)
- Verify correctness with existing conv bwd data tests

## Test Result

| Metric | Before | After |
|--------|--------|-------|
| Mean ratio (grouped/nongrouped) | 1.159 | **1.028** |
| Median ratio | 1.142 | **1.026** |
| Cases within 2% | 26 (4.4%) | **186 (31.8%)** |
| Cases >20% slower | 188 (32%) | **2 (0.3%)** |

NoShuffle + nongrouped-match instances achieve **~2.8% average gap**
with non-grouped kernels (down from ~16%).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com>
Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com>
Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu>
Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-18 06:49:50 -07:00

binary

[rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6)

2026-05-18 06:49:50 -07:00

convinvscale

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

convscale

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

convscale_add

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

convscale_reduce

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

convscale_relu

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

dynamic_unary

Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 )

2026-01-30 17:02:14 +01:00

multi_AB

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

unary

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

CMakeLists.txt

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

convnd_fwd_xdl_scaleadd_scaleadd_relu_bcasted_bias_fp16.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

convnd_fwd_xdl_scaleadd_scaleadd_relu_fp16.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

run_convnd_activ_dynamic_example.inc

Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 )

2026-01-30 17:02:14 +01:00

run_convnd_activ_example.inc

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

README.md

N-Dimensional Convolution with Activation

This example demonstrates an N-dimensional convolution forward pass fused with an activation function. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.

Mathematical Formulation

The operation performs an N-dimensional convolution followed immediately by an activation function.

N-Dimensional Convolution: A standard N-dimensional forward convolution. C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W}) Where In is the input tensor, W is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.
Activation Function: Apply an elementwise activation function to the convolution result. \text{Out} = \text{Activation}(C_{temp}) Common activation functions include:
- ReLU: \text{ReLU}(x) = \max(0, x)
- Sigmoid: \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
- Tanh: \text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
- GELU: \text{GELU}(x) = x \cdot \Phi(x) where \Phi is the standard Gaussian CDF
- Swish: \text{Swish}(x) = x \cdot \text{Sigmoid}(x)

The key optimization is that the intermediate tensor C_temp is never written to global memory. The activation function is applied directly to the convolution result held in registers.

Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue

The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.

Implicit GEMM Core: The convolution is transformed into an equivalent GEMM operation:
- Input Transformation: The input tensor is implicitly transformed using the im2col operation.
- Matrix Multiplication: The core computation is performed as a tiled matrix multiplication.
- Output Accumulation: Results are accumulated in registers as standard GEMM tiles.
Fused Activation Epilogue: Before storing results to global memory:
- Elementwise Activation: Apply the activation function to each element in the accumulated tile.
- Vectorized Operations: Use vectorized instructions where possible for activation computation.
- Store Activated Result: Write the final activated output directly to global memory.

This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.

Source Code Organization

convnd_activ_xdl.cpp: The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the DeviceConvNdActiv operation.
../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp: The device interface for N-dimensional convolution with activation fusion.
The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/62_convnd_activ
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./convnd_activ_xdl

# Run with verification, data initialization, and timing
./convnd_activ_xdl 1 2 1

Applications

Convolution with activation fusion is fundamental to many neural network architectures.

Convolutional Neural Networks (CNNs): Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
Computer Vision Models: Image classification, object detection, and segmentation networks all benefit from this fusion.
3D CNNs: Video analysis and medical imaging applications using 3D convolutions with activations.
Mobile and Edge Deployment: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
Training Acceleration: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.

Performance Benefits

This fusion provides several performance advantages:

Reduced Memory Bandwidth: Eliminates one full read/write cycle of the intermediate tensor
Improved Cache Locality: Data stays in cache/registers between convolution and activation
Fewer Kernel Launches: Reduces GPU kernel launch overhead
Better Instruction Scheduling: Allows better interleaving of compute and memory operations

Activation Function Considerations

Different activation functions have different computational characteristics:

ReLU: Very fast, just a comparison and conditional assignment
Sigmoid/Tanh: Require expensive exponential calculations
GELU: Involves error function computation, typically approximated
Swish: Combines multiplication with sigmoid computation

The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.