mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-10 16:28:38 +00:00
[CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208) ## Motivation Improve performance of grouped convolution backward-data kernels to match non-grouped kernel performance for G=1 cases. ## Technical Details - Add NoShuffle epilogue path (direct VGPR→Global writes) by setting `CDEBlockTransferScalarPerVector_NPerBlock = 1` - Add nongrouped-match instances with optimized BBlockTransfer parameters for better thread utilization - Add packed (flat) descriptor path for G=1 2D convolutions, using simpler tensor descriptors with fewer transform layers to reduce address computation overhead in the GEMM main loop - Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX zeroing cost) ## Test Plan - Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16) - Verify correctness with existing conv bwd data tests ## Test Result | Metric | Before | After | |--------|--------|-------| | Mean ratio (grouped/nongrouped) | 1.159 | **1.028** | | Median ratio | 1.142 | **1.026** | | Cases within 2% | 26 (4.4%) | **186 (31.8%)** | | Cases >20% slower | 188 (32%) | **2 (0.3%)** | NoShuffle + nongrouped-match instances achieve **~2.8% average gap** with non-grouped kernels (down from ~16%). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com> Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com> Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu> Co-authored-by: Cursor <cursoragent@cursor.com>