[CK Tile] Add WAVELET pipeline for forward grouped
convolution (#8009)
## Motivation
CK Tile forward grouped convolution trails classic CK on 3x3
convolutions whose
output-channel count is not divisible by 8, where the narrow output
store limits
the compute CShuffle epilogue. This ports the WAVELET pipeline (added
for
backward-weight in #7937) to the forward kernel to close that gap.
## Technical Details
- Kernel (`grouped_convolution_forward_kernel.hpp`): WAVELET
load/math-wave wiring,
mirroring the backward-weight implementation; the non-WAVELET path is
unchanged.
- Generator: implement `parse_native_fwd_instance`, the forward
native-instance parser.
- Registered WAVELET instances: profiler bf16 3 / fp16 5, tests 1 each.
WAVELET requires input channels divisible by 8 (it does not apply to
depthwise).
The bf16/fp16 instance asymmetry is intentional and measured: the VecC=8
tiles
never beat the compute pool in bf16 but win about 20% of divisible-by-8
3x3 shapes
in fp16, so VecC=8 is registered for fp16 only.
## Test Plan
- Correctness (CPU reference) for every registered profiler instance,
across VecC variants.
- Per-shape best-instance performance sweep over the 34 RetinaNet shapes
(bf16) and
a 200-shape cross-model sweep (bf16 and fp16), compared against classic
CK.
## Test Result
- Correctness: PASS for all instances.
- RetinaNet (bf16, vs classic CK): faster on 28 of 34 shapes, geomean
+19.5%; the
not-divisible-by-8 shapes up to 3.7x. One 1x1 stride-2 shape stays ~20%
behind
classic CK, unrelated to WAVELET.
- Cross-model (200 shapes): WAVELET wins 3x3 not-divisible-by-8 in both
dtypes
(up to 61% over the next-best compute instance); for divisible-by-8 3x3
it wins
about 20% of shapes in fp16 (3-11%) and none in bf16.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
[CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32) (#6838)
## Motivation
CK currently has no kernel optimized for depthwise convolution
(G=C_in=C_out, C=K=1 per group) and existing generic paths perform
poorly for this workload. This PR adds a dedicated depthwise conv
forward kernel in CK Tile.
## Technical Details
Adds a dedicated depthwise conv2d forward op to CK Tile that performs
direct convolution rather than falling back to the generic GEMM path.
The kernel is templatized by filter size, stride, and data type, and
compiled into ~60 instances covering common configurations (kernel
3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and
RDNA (gfx1100/gfx1200) architectures.
## Test Plan
- [x] Correctness and performance validated on gfx942, gfx950, and
gfx1100, with ckProfiler `grouped_conv_fwd` as baseline.
- [ ] MI300A (gfx942) and gfx1200 validation.
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1137
---------
Co-authored-by: GenDu <Gen.Du@amd.com>
[CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114)
## Motivation
Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details
- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-786
[CK_TILE] Add CK Tile bwd weight profiler (#4797)
## Motivation
To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.
## Technical Details
- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.
---------
Co-authored-by: Ville Pietilä <>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>