composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 00:39:02 +00:00

Files

Johannes Graner 3727d5220a [rocm-libraries] ROCm/rocm-libraries#5652 (commit 7dc7d1d)

[CK Conv] Wavelet gemm pipeline for bwd_weight convolution (#5652)

## Motivation

In the current CShuffleV3 backward weight kernel, the in-kernel
conv-to-GEMM transform generates significant INT32 VALU pressure per
MFMA instruction. On VALU-heavy shapes (e.g., G=1, 3×3, C=256), these
index computation ops compete with MFMA for VALU issue slots, creating a
bottleneck that cannot be resolved by pipeline prefetching alone.

This PR adds a wave-specialized ("wavelet") convolution backward weight
kernel that splits workgroup threads into two roles:
- **Load waves**: conv-to-GEMM address computation + global memory loads
+ LDS writes (all VALU/VMEM)
- **Math waves**: LDS reads + MFMA + CShuffle epilogue (no index
computation)

By physically separating the two instruction classes onto different
waves, VALU and MFMA execute on different hardware functional units
without contention.

## Technical Details

**Core kernel (new files):**
- `gridwise_gemm_xdl_waveletmodel_cshuffle_conv_v3.hpp` —
wave-specialized gridwise GEMM for conv bwd weight (2-way split: load +
math)
- `device_grouped_conv_bwd_weight_xdl_waveletmodel_cshuffle_v3.hpp` —
device op following CShuffleV3 patterns; `BlockSize =
TileMathThreadGroupSize` for MFMA wave assignment, `LaunchBlockSize =
TileLoad + TileMath` for kernel launch

**Wave pipeline (modified):**
- `gridwise_gemm_waveletmodel.hpp` — load/math wave pipeline structs
with `sched_group_barrier` scheduling hints to front-load VMEM reads
before address-advance VALU

**Two wave ratios:**
- **(4,4)**: 256 load + 256 math = 512 threads (8 waves). Best on large
shapes.
- **(4,2)**: 256 load + 128 math = 384 threads (6 waves). Best on small
shapes (fewer sync barriers, denser MFMA per math wave).

**Instance coverage (F16 and BF16 symmetric):**

| Ratio | Tiles | Layouts | ConvSpecs |
|-------|-------|---------|-----------|
| (4,4) | M128×N128, M64×N64, M128×N64, M64×N128 | 2D NHWGC, 3D NDHWGC |
Default, Filter1x1Stride1Pad0 |
| (4,2) | M64×N64, M128×N64, M64×N128 | 2D NHWGC | Default,
Filter1x1Stride1Pad0 |

**Existing wavelet model fixes:**
- `BlockSize` corrected from `math::max(TileLoad, TileMath)` to
`TileMathThreadGroupSize` in the flat-GEMM wavelet device op and
gridwise kernel

## Test Plan

- `test_grouped_convnd_bwd_weight` GTest: 34 hardcoded test cases
covering 1D/2D/3D, F16/BF16, G=1/2/16, various spatial sizes
- Performance benchmark: all 37 RetinaNet bwd_weight shapes on gfx950

```bash
ninja -C build test_grouped_convnd_bwd_weight
./build/bin/test_grouped_convnd_bwd_weight
```

## Test Result

**Correctness:** 34/34 GTest cases passed (F16/BF16 × 1D/2D/3D ×
Default/Filter1x1Stride1Pad0 × various G/N/K/C combinations).

**Performance:** Wavelet is the fastest overall instance on 12/37
RetinaNet shapes — all G=1, 3×3 convolutions with C=256 (the VALU-heavy
target shapes):

| Shape | Uplift vs best baseline |
|-------|------------------------|
| K=36, 7×7 | 1.91x |
| K=36, 100×100 | 1.60x |
| K=36, 13×13 | 1.43x |
| K=36, 25×25 | 1.38x |
| K=36, 50×50 | 1.38x |
| K=256, 100×100 | 1.24x |
| K=256, 13×13, s=2 | 1.20x |
| K=256, 25×25, s=2 | 1.20x |
| K=256, 7×7 | 1.17x |
| K=256, 13×13 | 1.13x |
| K=2376, 50×50 | 1.05x |
| K=2376, 100×100 | 1.06x |

Where wavelet does not win (25/37): 1×1 convolutions (explicit kernel
does host-side transform), grouped convolutions with small per-group
channels, and shapes where standard CShuffleV3 already amortizes VALU
overhead.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: jakpiase <jakpia21@gmail.com>

2026-05-18 17:46:01 +02:00

impl

[rocm-libraries] ROCm/rocm-libraries#5652 (commit 7dc7d1d)

2026-05-18 17:46:01 +02:00

conv_tensor_rearrange_op.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_backward_data_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_backward_weight_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_forward_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_avgpool_bwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_base.hpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

device_batched_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_e_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_multi_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_multiple_d_gemm_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_softmax_gemm_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_softmax_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batchnorm_backward.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batchnorm_forward.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batchnorm_infer.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_cgemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_contraction_multiple_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_bwd_data.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_fwd_bias_activation_add.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_fwd_bias_activation.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_tensor_rearrange.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_elementwise_normalization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_elementwise_scale.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_elementwise.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_bias_e_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_dequantB.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_d_ab_scale.hpp

Wmma support for gemm_ab_scale (#3314 )

2025-12-11 09:06:20 +01:00

device_gemm_multiple_d_layernorm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_d_multiple_r.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_mx.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_reduce.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_splitk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_streamk_v2.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_streamk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_v2.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm.hpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

device_grouped_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_bwd_data_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_bwd_weight_multiple_d.hpp

[Conv] Enable bwd weight splitk autodeduction with cap (#3656 )

2026-01-29 17:40:28 +00:00

device_grouped_conv_bwd_weight.hpp

[Conv] Enable bwd weight splitk autodeduction with cap (#3656 )

2026-01-29 17:40:28 +00:00

device_grouped_conv_fwd_multiple_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_fwd_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_fixed_nk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_multi_abd_fixed_nk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_multi_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_softmax_gemm_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_splitk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_tile_loop.hpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

device_grouped_gemm.hpp

[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70)

2026-05-14 12:51:08 -07:00

device_max_pool_bwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_multiple_reduce.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_normalization_bwd_data.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_normalization_bwd_gamma_beta.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_normalization_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_pool_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_put_element.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_reduce_multi_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_reduce.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_softmax.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_splitk_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

gemm_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

helper.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

masking_specialization.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

matrix_padder.hpp

[rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)

2026-02-28 12:10:11 -08:00

reduction_operator_mapping.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

tensor_layout.hpp

[Compiler] Addressing new compiler warnings (#3640 )

2026-02-02 09:39:48 -08:00

tensor_specialization.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

welford_helper.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00