composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 16:59:10 +00:00

Author	SHA1	Message	Date
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
JH-Leon-KIM-AMD	00e1d82ae7	[rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9) [CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732) ## Motivation Fix silent miscompute in the grouped convolution backward-data kernel (`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride > dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast path that dropped all but the first sub-GEMM, producing zeroed slices of `dx` on the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness without giving up the perf gains PR #6208 delivered on stride=1 shapes. ## Technical Details - Tighten the flat-descriptor fast-path gate to require `arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its original purpose). For stride > 1, the implicit GEMM is split into `gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly; routing them through the flat path required dropping all but the first, which was the source of the bug. - Stride > 1 now falls through to the existing grouped CShuffle path, which packs all sub-GEMMs into one descriptor array and walks them on-device in a single kernel launch. This is the pre-PR-6208 production path; correctness is established and per-dispatch launch count is minimised. - Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0) intersection in `test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with `gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this intersection (all stride>1 cases used G=2; all G=1 cases used stride=1), which is why PR #6208's regression slipped past CI. ## Test Plan - `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke tier — runs on every PR via `smart_build_and_test.sh`). - End-to-end verify (`verify=1`) via `example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes including the original ALMIOPEN-1959 case and a cross-bucket (`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32` buckets. - ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate via an environment variable: full kernel-family enumeration, winning kernel + its avg_time reported under each gate. 33/41 shapes completed before the sweep was stopped; the remaining 8 were the largest i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape enumeration budget (not relevant to the verdict). ## Test Result ### Correctness \| Test \| Result \| \|---\|:---:\| \| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D, includes 3 new regression shapes) \| 12/12 PASSED in 14.18 s \| \| `test_grouped_convnd_bwd_data_interface` (API checks) \| PASSED in 0.28 s \| \| ALMIOPEN-1959 stride=2 (`verify=1`) \| PASSED \| \| stride=1 K3 (`verify=1`) \| PASSED \| \| stride=3 K3 `gemms_count=9` (`verify=1`) \| PASSED \| \| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) \| PASSED \| ### Performance (ckProfiler A/B on gfx942 / MI300X) Comparing the post-fix gate (flat path only when `gemms_count_==1`, column "B") vs the inner-loop variant that keeps the flat path on stride>1 (column "A") across 25 stride>1 shapes where production picks a `_v1` instance (so the gate actually fires): \| Stride \| Shapes \| A wins \| Tie \| B wins \| Notes \| \|:------:\|:------:\|:------:\|:---:\|:------:\|---\| \| 1 (sanity, gate moot) \| 3 \| 0 \| 3 \| 0 \| gate doesn't differentiate — A == B as expected \| \| > 1 (gate fires) \| 25 \| 0 \| 11 \| 14 \| B wins +6% to +32%; A never wins \| Highlights from the firing-gate cases: \| Shape (G=1, stride=2 unless noted) \| A ms \| B ms \| B vs A \| \|---\|---:\|---:\|---:\| \| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) \| 0.183 \| 0.171 \| B +6% \| \| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) \| 0.054 \| 0.045 \| B +17% \| \| i2v-010 (N=1, K=C=384, 3×3, 277×209) \| 0.174 \| 0.125 \| B +28% \| \| Synthetic 50×50 K3 N=32 K=C=256 \| 0.131 \| 0.088 \| B +32% \| Why B wins everywhere the gate fires: for `gemms_count = N`, the flat path needs N kernel launches (one per sub-GEMM), while the grouped path loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) × launch-tax is a structural disadvantage A can't recover from. ### Diff \| File \| Lines \| \|---\|---:\| \| `include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp` \| +14 / −8 (one extra condition + expanded dispatch comment) \| \| `test/.../test_grouped_convnd_bwd_data.cpp` \| +9 / −0 (3 new shapes) \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 09:59:14 +03:00
Bartłomiej Kocot	ebb97044f4	[rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1) Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664) ## Motivation Incorrect results has been introduced for some conv bwd cases. ## Technical Details This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ALMIOPEN-1959	2026-05-22 12:28:49 +00:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
Johannes Graner	3727d5220a	[rocm-libraries] ROCm/rocm-libraries#5652 (commit 7dc7d1d) [CK Conv] Wavelet gemm pipeline for bwd_weight convolution (#5652) ## Motivation In the current CShuffleV3 backward weight kernel, the in-kernel conv-to-GEMM transform generates significant INT32 VALU pressure per MFMA instruction. On VALU-heavy shapes (e.g., G=1, 3×3, C=256), these index computation ops compete with MFMA for VALU issue slots, creating a bottleneck that cannot be resolved by pipeline prefetching alone. This PR adds a wave-specialized ("wavelet") convolution backward weight kernel that splits workgroup threads into two roles: - Load waves: conv-to-GEMM address computation + global memory loads + LDS writes (all VALU/VMEM) - Math waves: LDS reads + MFMA + CShuffle epilogue (no index computation) By physically separating the two instruction classes onto different waves, VALU and MFMA execute on different hardware functional units without contention. ## Technical Details Core kernel (new files): - `gridwise_gemm_xdl_waveletmodel_cshuffle_conv_v3.hpp` — wave-specialized gridwise GEMM for conv bwd weight (2-way split: load + math) - `device_grouped_conv_bwd_weight_xdl_waveletmodel_cshuffle_v3.hpp` — device op following CShuffleV3 patterns; `BlockSize = TileMathThreadGroupSize` for MFMA wave assignment, `LaunchBlockSize = TileLoad + TileMath` for kernel launch Wave pipeline (modified): - `gridwise_gemm_waveletmodel.hpp` — load/math wave pipeline structs with `sched_group_barrier` scheduling hints to front-load VMEM reads before address-advance VALU Two wave ratios: - (4,4): 256 load + 256 math = 512 threads (8 waves). Best on large shapes. - (4,2): 256 load + 128 math = 384 threads (6 waves). Best on small shapes (fewer sync barriers, denser MFMA per math wave). Instance coverage (F16 and BF16 symmetric): \| Ratio \| Tiles \| Layouts \| ConvSpecs \| \|-------\|-------\|---------\|-----------\| \| (4,4) \| M128×N128, M64×N64, M128×N64, M64×N128 \| 2D NHWGC, 3D NDHWGC \| Default, Filter1x1Stride1Pad0 \| \| (4,2) \| M64×N64, M128×N64, M64×N128 \| 2D NHWGC \| Default, Filter1x1Stride1Pad0 \| Existing wavelet model fixes: - `BlockSize` corrected from `math::max(TileLoad, TileMath)` to `TileMathThreadGroupSize` in the flat-GEMM wavelet device op and gridwise kernel ## Test Plan - `test_grouped_convnd_bwd_weight` GTest: 34 hardcoded test cases covering 1D/2D/3D, F16/BF16, G=1/2/16, various spatial sizes - Performance benchmark: all 37 RetinaNet bwd_weight shapes on gfx950 ```bash ninja -C build test_grouped_convnd_bwd_weight ./build/bin/test_grouped_convnd_bwd_weight ``` ## Test Result Correctness: 34/34 GTest cases passed (F16/BF16 × 1D/2D/3D × Default/Filter1x1Stride1Pad0 × various G/N/K/C combinations). Performance: Wavelet is the fastest overall instance on 12/37 RetinaNet shapes — all G=1, 3×3 convolutions with C=256 (the VALU-heavy target shapes): \| Shape \| Uplift vs best baseline \| \|-------\|------------------------\| \| K=36, 7×7 \| 1.91x \| \| K=36, 100×100 \| 1.60x \| \| K=36, 13×13 \| 1.43x \| \| K=36, 25×25 \| 1.38x \| \| K=36, 50×50 \| 1.38x \| \| K=256, 100×100 \| 1.24x \| \| K=256, 13×13, s=2 \| 1.20x \| \| K=256, 25×25, s=2 \| 1.20x \| \| K=256, 7×7 \| 1.17x \| \| K=256, 13×13 \| 1.13x \| \| K=2376, 50×50 \| 1.05x \| \| K=2376, 100×100 \| 1.06x \| Where wavelet does not win (25/37): 1×1 convolutions (explicit kernel does host-side transform), grouped convolutions with small per-group channels, and shapes where standard CShuffleV3 already amortizes VALU overhead. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com>	2026-05-18 17:46:01 +02:00
JH-Leon-KIM-AMD	9a5d1ea791	[rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6) [CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208) ## Motivation Improve performance of grouped convolution backward-data kernels to match non-grouped kernel performance for G=1 cases. ## Technical Details - Add NoShuffle epilogue path (direct VGPR→Global writes) by setting `CDEBlockTransferScalarPerVector_NPerBlock = 1` - Add nongrouped-match instances with optimized BBlockTransfer parameters for better thread utilization - Add packed (flat) descriptor path for G=1 2D convolutions, using simpler tensor descriptors with fewer transform layers to reduce address computation overhead in the GEMM main loop - Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX zeroing cost) ## Test Plan - Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16) - Verify correctness with existing conv bwd data tests ## Test Result \| Metric \| Before \| After \| \|--------\|--------\|-------\| \| Mean ratio (grouped/nongrouped) \| 1.159 \| 1.028 \| \| Median ratio \| 1.142 \| 1.026 \| \| Cases within 2% \| 26 (4.4%) \| 186 (31.8%) \| \| Cases >20% slower \| 188 (32%) \| 2 (0.3%) \| NoShuffle + nongrouped-match instances achieve ~2.8% average gap with non-grouped kernels (down from ~16%). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com> Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com> Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 06:49:50 -07:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Illia Silin	22b9feb40f	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings (#7111) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
Illia Silin	b5d0cff36f	[rocm-libraries] ROCm/rocm-libraries#6933 (commit ac8b7d9) [CK] Filter out unsupported targets. (#6933) ## Motivation Filter out any unsupported targets, e.g., gfx900, gfx906, gfx90c, from the GPU_TARGETS or GPU_ARCHS lists. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 03:23:08 +00:00
jakpiase	fc39a02cda	[rocm-libraries] ROCm/rocm-libraries#6624 (commit 47d0162) [CK_TILE] Grouped Convolution Backward Data Direct Load (#6624) ## Proposed changes Add Grouped Convolution Backward Data with Direct Load into DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation. This enables direct global memory loading (bypassing LDS) for the backward data convolution path on gfx950, following the same pattern used in both backward weight and forward convolution. Direct load convolution backward data improves performance by avoiding LDS round-trips for certain configurations on gfx950, which supports a wider range of instructions. Currently correctness is checked only at usage point, but should be extended to a standalone UT in the future.	2026-04-23 11:16:55 +02:00
Illia Silin	d16061f578	[rocm-libraries] ROCm/rocm-libraries#6550 (commit c396de9) [CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
Ville Pietilä	c7fe8b72c6	[rocm-libraries] ROCm/rocm-libraries#6421 (commit 05b0753) [MIOpen][CK] Fix bwd weight conv test failures by disabling one block-GEMM V5 instance for 3D convs (#6421) ## Motivation Due to compiler version update, there are test failures in the test target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current develop branch, the test failures are blocking any PR merges into develop. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as the tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. First attempt to fix this problem had to be reverted in #6400 because it broke MIOpen internal DB sync tests. ## Technical Details The root cause for the test failures are the block-GEMM V5 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size. The V5 pipeline uses double register buffer that in combination with large tile size causes high register pressure. The latest version of compiler handles the register spillage incorrectly for `gfx90a`, which cause the kernel to output incorrect results. The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that do not use direct load for are divided into two groups - Base instances - Instances that result into high register usage (currently only one instance - one that causes the test failures). This division allows to disable only the V5 block-GEMM flavor of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The selective disabling leaves the set of instances for 1D and 2D convolutions unaffected, and removes at runtime two V5 block-GEMM instances (`ConvBwdWeightDefault` and `ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the device is `gfx90a`. Because MIOpen uses CK's type string (provided by method `GetTypeString`) to identify the instances, the DB sync tests are expected to unaffected since there are still the V2 block-GEMM instances that result in the same type string (`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by running the MIOpen DB sync tests that are not part of the normal CK PR build. ## Test Plan Running all CI tests + the MIOpen internal DB sync tests is sufficient to verify the correctness of the code changes. ## Test Result Verified locally that the previously failing tests `TestGroupedConvndBwdWeight3d/4.Test3D` and `TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts - 231 on `gfx90a` - 233 on `gfx942` and are currently passing. This confirms the expectation that two instances per data type should be disabled on `gfx90a`. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Ville Pietilä <>	2026-04-17 09:16:32 +03:00
Brock Hargreaves	a4e36c1b89	[rocm-libraries] ROCm/rocm-libraries#6400 (commit c0b3c95) [MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400) Reverts ROCm/rocm-libraries#6343 This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.	2026-04-13 20:46:07 -06:00
Ville Pietilä	0f2279920b	[rocm-libraries] ROCm/rocm-libraries#6343 (commit 3604475) [CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343) ## Motivation Due to compiler version update, there are test failures in the test suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current `develop` branch, the test failures are blocking any PR merges into `develop`. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. ## Technical Details This PR disables the compilation of the problematic bwd weight conv instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that allows us to detect when we are compiling for `gfx90a`. Using the new CMake flag, compilation of instance `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`. Co-authored-by: Ville Pietilä <>	2026-04-13 13:40:27 +02:00
Aviral Goel	81a5826132	[rocm-libraries] ROCm/rocm-libraries#6323 (commit a668483) CK: Extract shared boilerplate from 47 gemm_quant test files (#6323) Depends on #6303 ## Summary Extract shared test boilerplate (includes, type aliases, test fixture macros) from 47 `test_gemm_quant_` files into a single `test_gemm_quant_common.hpp` header. Each test file is reduced from ~50 lines of boilerplate to ~5 lines. \| Metric \| Value \| \|--------\|-------\| \| Files changed \| 48 \| \| Insertions \| +413 \| \| Deletions \| −1,106 \| \| Net lines removed* \| −693 \| ### What changed \| Before \| After \| \|--------\|-------\| \| 47 test files, each with ~50 lines of identical includes, type aliases, and fixture macros \| 1 shared header (`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include + params) \| ### Readability assessment A code realist review confirmed this change improves readability: the 47 test files had identical boilerplate obscuring the only meaningful content — the `GemmConfig` type alias and test dimensions. After the refactoring, each file's unique configuration is immediately visible, and adding a new test variant requires specifying only the varying parameters instead of copying 50 lines. ### Cumulative cleanup series stats \| PR \| Description \| Net lines \| \|----\|-------------\|-----------\| \| #6300 \| Remove 61 dead `#if 0` blocks \| −2,648 \| \| #6302 \| Remove 41 commented-out dead code blocks \| −2,861 \| \| #6303 \| Remove 4 orphaned files \| −3,886 \| \| This PR \| Extract gemm_quant test boilerplate \| −693 \| \| Total \| \| −10,088 \|	2026-04-11 06:00:26 -04:00
Aviral Goel	c2663ce9fd	[rocm-libraries] ROCm/rocm-libraries#6303 (commit 784c268) CK: Remove 4 orphaned files with verified replacements (~1,025 lines) (#6303) Depends on #6302 ## Summary Remove 4 orphaned files that have verified replacements already in the build. \| File \| Reason \| Replacement \| \|------\|--------\|-------------\| \| `test_gemm_pipeline_compiler.cpp` \| Refactored into 13 smaller tests \| `_compv3`, `_compv4`, `_mem`, `_persistent`, etc. \| \| `test_grouped_gemm_quant.cpp` \| Refactored into 5 smaller tests \| `_rowcol`, `_tensor`, `_aquant`, `_bquant`, etc. \| \| `..._f8_f8_f16_..._comp_default_instance.cpp` \| Superseded by split files \| `_part1.cpp` + `_part2.cpp` \| \| `..._f8_f8_f16_..._comp_kpadding_instance.cpp` \| Superseded by split files \| `_part1.cpp` + `_part2.cpp` \| Each deletion was verified: - Original file is NOT in any CMakeLists.txt - Replacement files ARE in CMakeLists.txt and actively compiled - Content is fully covered by the replacement files	2026-04-10 11:22:31 -04:00
Aviral Goel	c7eb33078c	[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8) CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00
Estevan Vedovelli	a1beb9aa3e	[rocm-libraries] ROCm/rocm-libraries#5675 (commit fbd7fa7) [CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used in addition to MIOPEN_REQ_LIBS_ONLY (#5675) ## Motivation When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and -DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly installed. This change is necessary to allow hipTensor to build with TheRock without the need to rebuild CK from source. ## Technical Details The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake, following the same approach used to the conv target (for MIOpen). ## Test Plan Manually test the build and installation with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as installed. ## Test Result The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the targets contraction, reduction and other, while `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv, contraction, reduction and other. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-25 17:58:59 -06:00
JP-Fernando	c31cf4fb4b	[rocm-libraries] ROCm/rocm-libraries#4591 (commit d34e981) [CK] Add BF16^3 support to grouped conv bwd weight: bilinear and scale (#4591) ## Motivation Until now, XDL grouped conv bwd weight for bilinear and scale only supported bf16f32bf16. Therefore, bf16bf16bf16 support should be added. ## Technical Details Instances were added to the relevant files in `library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_bwd_weight/` folder. In addition, `add()` functions were included in new files in `library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/xdl/` and `library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_scale/xdl/` folders. The new .cpp files were also included in the `CMakeFiles.txt` files of both folders. ## Test Plan Execute `grouped_convnd_bwd_weight` tests to check execution on different architectures. The tests for bilinear and scale already include the tuple `std::tuple<ck::half_t, ck::half_t, ck::half_t, ck::Number<3>>`, so in principle, there is nothing to modify in the tests themselves. ## Test Result `gfx1201`: Tests passed. `gfx1100`: Tests passed. `gfx90a`: Tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-03-11 13:05:44 +01:00
Christopher Millette	70360dc74f	[rocm-libraries] ROCm/rocm-libraries#5030 (commit 8e02a26) [CK] Replace tuple value construction with tuple_element_t type extraction [1A] (#5030) ## Summary ### Rationale CK's device operation instance registration uses `add_device_operation_instances` at ~1,850 call sites to register GPU kernel configurations. The existing implementation constructs `std::tuple` values just to extract their types via `decltype`, then copy-constructs each instance into `make_unique`. This is wasteful — only the types matter, not the values — and forces the compiler to instantiate the full `std::tuple` constructor and `std::get` machinery at every call site. ### What changed - Replace `remove_cvref_t<decltype(std::get<i>(tuple_obj))>` with `std::tuple_element_t<i.value, TupleType>`, which extracts the type directly without constructing any values - Replace copy-from-default `make_unique<T>(value)` with direct default construction `make_unique<T>()` — all CK device operation instances are stateless structs with configuration encoded in template parameters - Add `static_assert(std::is_default_constructible_v<NewOpInstance>)` to enforce this contract at compile time with a clear error message - Add Doxygen documentation for this high-traffic public API ### Value - Eliminates unnecessary template instantiation of `std::tuple` constructors and `std::get` across ~1,850 call sites - Establishes a cleaner, more intention-revealing pattern for type-only tuple usage - The `static_assert` prevents silent breakage if a non-default-constructible type is ever added - No runtime behavior change — zero risk ### Files changed (9) - `add_device_operation_instance.hpp`: Core pattern change - 3 example files, 3 reduce instance headers, 1 convolution header, 1 profiler header ## Test plan - [ ] Existing CI tests cover all ~1,850 call sites (GEMM, reduce, softmax, convolution) - [ ] `static_assert` provides compile-time validation stronger than runtime tests - [ ] No runtime behavior change — stateless struct default construction is identical to copy-from-default - [ ] Compatible with both `std::tuple` and `ck::type_list` containers 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-06 09:27:27 -07:00
John Afaganis	9f702434e5	[rocm-libraries] ROCm/rocm-libraries#5101 (commit d4754fa) [CK] Remove log spam for deprecated convolutions (#5101) ## Motivation The `CK_BUILD_DEPRECATED` flag guards legacy non-grouped convolution instances, but both branches of every guard emit a `#pragma` message on every build, adding noise without actionable information. According to some recent testing, these non-grouped instances can outperform their grouped replacements in certain configurations, so their continued availability behind the flag remains valuable. This change removes only the warning directives while preserving all guards and guarded code paths. ## Technical Details Removed all `#pragma` message lines referencing deprecated instances from 25 convolution instance source files spanning conv1d_bwd_data, conv2d_fwd, conv2d_bwd_data, conv2d_fwd_bias_relu, conv2d_fwd_bias_relu_add, conv3d_bwd_data, grouped_conv3d_fwd, grouped_conv3d_bwd_data, and grouped_conv3d_bwd_weight. The `#if CK_BUILD_DEPRECATED` / `#else` / `#endif` preprocessor guards and all guarded code remain unchanged. ## Test Plan No functional change. The CK_BUILD_DEPRECATED conditional logic is unmodified; only #pragma message directives were removed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 17:37:44 -07:00
Bartłomiej Kocot	a67aaa1b96	[rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2) [CK] Port non-grouped convolution instances to the grouped kernels (#4875) ## Motivation Port non-grouped convolution instances to the grouped kernels to deprecated older non-grouped implementations. ## Technical Details Add the same instances as non-grouped but using grouped kernel. ## Test Plan test_grouped_convnd_fwd ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-724	2026-02-28 01:24:30 +00:00
Yung-sheng Tu	743552b6fd	[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f) Implement device_grouped_gemm_fixed_nk_bias for RDNA4 (#4340) ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-26 00:28:09 +00:00
Zoltán Lakatos	cb60fdd58d	[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f) [CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425) ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-25 05:16:07 +00:00
assistant-librarian[bot]	263288a383	[rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49) 173 implement device grouped gemm fixed nk for rdna4 (#4299) ## Proposed changes This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk instance library using for WMMA. The implementation is based on the existing DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level structure, but replaces the XDL kernel with a WMMA-based one. It uses the GridwiseGemm_wmma_cshuffle_v3 kernel. At this stage, the focus is functional correctness and compatibility, not performance tuning. ## Technical Details - Device struct for grouped gemm fixed NK - Example code for the WMMA version - Unit tests for both new wmma implementation and the reference XDL code (previously missing) - Generic ck profiler interface with the purpose of calling unit tests. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3668](https://github.com/ROCm/composable_kernel/pull/3668) 🧑‍💻 Originally authored by @bidlekm --------- Co-authored-by: Marton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: bidlekm <bidlekmarton@gmail.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-19 09:13:05 +01:00
Ville Pietilä	6b9df93342	[rocm-libraries] ROCm/rocm-libraries#4652 (commit 39a5a53) Revert "[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size." (#4652) PR ROCm/rocm-libraries#4275 contains CK fwd conv instances optimized for `gfx950` and they do not compile for other architectures such as `gfx940`. To ensure that the optimized instances are compiled only for `gfx950`, compile-time guard `#if defined(CK_USE_GFX950)` was used. This approach works correctly when we compile for a single architecture, but when we compile simultaneously for multiple architectures, flag `CK_USE_GFX950` is set for non-gfx950 archs as well. As a result, the multi-arch compilation fails. The problem doesn't appear in the ROCm libraries CI/CD pipeline since only one architecture is compiled at a time. Hence, the CI/CD passed for the original PR. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-18 17:02:13 -07:00
assistant-librarian[bot]	788d66025d	[rocm-libraries] ROCm/rocm-libraries#4275 (commit 2e07a39) [CK] Add new fwd conv fp16/bf16 instances optimized for unit group size. (#4275) ## Proposed changes Added new FP16/BF16 instances that are optimized for group size = 1. The new instance use the compute optimized block GEMM pipeline. \| CK prof command \| Baseline (TFLOPs) \| New V3 instances (TFLOPs) \| \|:-----\|:------:\|------:\| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1 1 1 \| 858.818 \| 962.293 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 256 256 3 3 100 100 1 1 1 1 1 1 1 1 \| 979.987 \| 1121.11 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 50 50 1 1 1 1 1 1 1 1 \| 945.951 \| 1091.66 \| --- 🔁 Imported from [ROCm/composable_kernel#3670](https://github.com/ROCm/composable_kernel/pull/3670) 🧑‍💻 Originally authored by @vpietila-amd --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-02-17 17:58:11 -07:00
Jan Patrick Lehr	9c2dd2941b	[rocm-libraries] ROCm/rocm-libraries#4419 (commit e241f8b) [CK] Work around staging compiler lifetime warning ## Motivation The staging compiler enables lifetime-safety warnings and we already worked around a few of them. This works around a few more instances that came up recently on gfx950 builds. The initial PR that resolved most issues: https://github.com/ROCm/composable_kernel/pull/3640 ## Technical Details This follows the pattern to locally ignore the newly added lifetime-safety warnings that were moved from experimental to production in upstream LLVM. As a result, CK turned them on and treats them as errors, which prevents the staging compiler from building CK. ## Test Plan ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-12 22:12:57 +00:00
Bartłomiej Kocot	ea4942cd02	[rocm-libraries] ROCm/rocm-libraries#4506 (commit d9ccef7) Revert "[CK Conv] Add bwd weight instance for large-k shape" (#4506) Reverts ROCm/rocm-libraries#4266 due to CI failures. Should be investigated by @johannes-graner	2026-02-11 21:37:50 +00:00
Johannes Graner	40cec769ce	[rocm-libraries] ROCm/rocm-libraries#4266 (commit 1d8094d) [CK Conv] Add bwd weight instance for large-k shape MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes This instance improves the shape used in `./bin/ckProfiler grouped_conv_bwd_weight 1 2 0 2 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1 1 1 all` from 10.3 ms to 6.6 ms. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered	2026-02-10 16:58:04 +00:00
Ville Pietilä	57d26db844	[rocm-libraries] ROCm/rocm-libraries#4273 (commit 591f504) [CK] Add fwd conv group merging to v3 conv instances MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Added conv group merging to the (universal) V3 fwd conv pipeline. The new instance improves fwd conv performance when the number of input/output channel per group is low. On MI300 (`gfx942`) we get \| CK prof command \| Baseline (TFLOPS) \| V3 group merging (TFLOPS) \| \|:-----\|:------:\|------:\| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 \| 3.86035 \| 8.36796 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 \| 10.1867 \| 13.4677 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1 \| 11.7875 \| 16.3657 \|	2026-02-08 11:35:56 +00:00
Illia Silin	569640dc70	Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 )" (#3705 ) This reverts commit `301eb5cf08`.	2026-02-03 09:52:14 -08:00
Zoltán Lakatos	301eb5cf08	Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 ) * device struct implementation * added xdl grouped multi abd fixed nk testing * wmma implementation fixed * avoid unnecessary device mem allocation and code cleanups * cleanup instances definitions * wmma examples added * code cleanups * fix clang format * typo and compilation fixes related to reference gemm * fix compilation error due to std::remove_cvref_t * added missing hip_check_error includes * correction to example instances * review commentes addressed * removed split-k from testing * code formatting --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-02 13:58:11 -08:00
Kiefer van Teutem	2377a62837	Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 ) * Adding remaining flavors for grouped conv fwd As titled. Following variants are added: - grouped_conv2d_fwd_dynamic_op - grouped_conv3d_fwd_dynamic_op - grouped_conv3d_fwd_bilinear - grouped_conv3d_fwd_convscale - grouped_conv3d_fwd_convinvscale - grouped_conv3d_fwd_convscale_add - grouped_conv3d_fwd_convscale_relu - grouped_conv3d_fwd_scale - grouped_conv3d_fwd_combconvscale - grouped_conv3d_fwd_scaleadd_scaleadd_relu * Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3. * Do not build f8 / bf8 only flavor tests on RDNA3 * Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time. * Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream. * adding int8 and fp16 overloads to the elementwise operations * fixed copilot nits * Addressing review comments: - removed unnecessary examples for dynamic op - removed unnecessary conv specalizations for all the flavors - removed spurious bilinear and scale source files * clang-format * reduced no of tests --------- Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>	2026-01-30 17:02:14 +01:00
Illia Silin	05ef93a69d	Add a flag to build CK libs required for HipTensor. (#3684 ) * create a filter to build only libs required by hiptensor * allow building libs for miopen and hiptensor at the same time * tweak the lib filtering logic one more time	2026-01-29 16:12:49 -08:00
Enrico Degregori	f16d9100e4	Multi AB support for wave transfer (#3578 ) * Add multi AB support to wave transfer * Improviments to multi ABD examples * Add instances and use intrawave v1 instead of interwave * Apply changes to other transfers * Wave transfer: add support for multiple internal vgpr buffers * Fix compilation error gfx11	2026-01-29 10:29:40 -08:00
Bartłomiej Kocot	83b58bb0c3	Grouped Conv Bwd Weight Direct Load (#3648 ) * Grouped Conv Bwd Weight Direct Load * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp * Implement group merging for bwd_weight and add instances * Link direct load instances * builder fixes * fix * fixes * fix --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com>	2026-01-28 15:31:54 -06:00
Johannes Graner	c190d8d61f	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible	2026-01-27 09:49:42 +01:00
Enrico Degregori	2e49b6b2f7	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format	2026-01-26 12:57:09 -08:00
yinglu	8942a19d5e	ck: add CK_USE_GFX950 macro (#3636 )	2026-01-26 11:38:45 -08:00
Ville Pietilä	7ac3794284	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <>	2026-01-25 13:42:23 +01:00
Wojciech Laskowski	2e08a7e5ab	WMMA grouped conv fwd large tensor bias bnorm clamp (#3595 ) * Added bias_bnorm_clamp for WMMA conv fwd large tensor. Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp * changed strategy to handle GemmArgs array * Adding generic instance * fixed last nits from reviewers and copilot	2026-01-23 12:20:00 +01:00
Wojciech Laskowski	81ee19bd2c	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops	2026-01-23 12:19:51 +01:00
Bartłomiej Kocot	7b3db1a878	Grouped conv fwd direct load vector=2 (#3632 )	2026-01-23 10:29:59 +01:00
ApoorvaKalyani	8daf6ea302	Grouped conv_fwd_bias_bnorm_clamp instances and tests (#3525 ) * Added bias_bnorm_clamp instances. * fwd_bias_bnorm_clamp comp instances * fwd_bias_bnorm_mem_inter and mem_intra instances * fwd_bias_bnorm_merged_group_instances * fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances * Device level changes for fwd_bias_bnorm_clamp * Added the test to the regression test list. * Removed the part 2 and 2x instances * Removed the irrelevant checks in wmma * Refactored the instances to adapt to new device implementation * Updated the reference and include files * enabling tests * Added missing profiler * Added missing instance entry , deleted by mistake * Reduce bias bnorm clamp instances to only a single generic one. * Clean up cmakelists file * clang-format * Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up. * Renaming some instance lists and add functions to be more standardized. * Commented out non default instances. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2026-01-22 09:53:59 +01:00
Erwin Terpstra	d5ae81b292	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2026-01-20 13:06:59 -08:00
Estevan Vedovelli	7d8bca7ddc	Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598 ) * Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions Enables hipTensor to access the WMMA HW functionalities for these combinations of datatype on gfx11 and gfx12. * Fix change to contraction scale tests * Fix clang-format	2026-01-20 09:39:57 -08:00
Wojciech Laskowski	b09121f860	WMMA support for batched_gemm_reduce (#3332 ) Summary: - added new device impl of Batched GEMM Reduce for WMMA - added instance library - added WMMA impl to the Batched GEMM Reduce tests	2026-01-20 10:50:46 +01:00
Erwin Terpstra	fe40a5d139	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method	2026-01-17 08:30:27 +01:00

1 2 3 4 5 ...

597 Commits