mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 00:39:02 +00:00

Files

Johannes Graner b7c8fb164f [rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d)

[CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight
 instances (#7937)

## Motivation

CK Tile had no pipeline competitive with old CK's wavelet on the
RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a
wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive
kernel for spatial small-K shapes.

## Technical Details

- New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`):
workgroup split into math waves (LDS read + MFMA) and load waves (DRAM
read + LDS write).
- VGPR role-split: `operator()` has two top-level mutually-exclusive
`is_math` branches so the allocator overlays both roles onto the same
physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy.
Correctness depends on identical `block_sync_lds` counts on both arms
plus a matching load-wave barrier stub in the epilogue
(`cshuffle_epilogue.hpp`).
- Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`):
`kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub.

Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36
C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27%
(googlenet K=320 by 16-23%); the role-split roughly halves the parity
gap vs old CK on the 13x13 fp16 shape.

## Test Plan

- `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16,
`split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a
broad 2D sweep.
- Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64}
(barrier-parity / deadlock check).
- `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet
instances.

## Test Result

- All wavelet instances CPU-verify correct across the split-K sweep; no
hangs (dual-arm barrier sequence matches).
- Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best
non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better
vs old CK on the majority of spatial shapes.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-06-02 08:51:17 +00:00

algorithm

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

arch

[rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d)

2026-06-02 08:51:17 +00:00

container

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

numeric

[rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93)

2026-05-30 01:28:48 +00:00

tensor

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

utility

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

config.hpp

[rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7)

2026-05-30 00:10:26 +00:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device