composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
Emily Martins	674f7cdc0e	[rocm-libraries] ROCm/rocm-libraries#8141 (commit d3defa6) [CK] Remove Stream-K from old CK ## Motivation Since Stream-K has a CK Tile implementation, we no longer need Stream-K in old CK. Hence, this PR removes Stream-K from old CK. ## Technical Details All Stream-K artifacts in old CK have been removed including examples, tests, kernels, and CK profiler artifacts. ## Test Plan Ran a CI run on the branch before publishing PR. ## Test Result All tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>	2026-06-08 16:47:26 +00:00
Bartłomiej Kocot	28f2966762	[rocm-libraries] ROCm/rocm-libraries#7734 (commit 03ffb9d) [CK] Grouped Convolution Global Load/Store instances ## Motivation Support global load and store in grouped convolutions using instance factory. ## Technical Details - add new instances for each direction - add new tests for large cases ## Test Plan New test for large cases ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1255	2026-06-06 22:52:59 +00:00
John Afaganis	96c39b331e	[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da) [ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility (#7829) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary CK source files must be compilable via hipRTC (HIP runtime compilation), whose preprocessor does not accept non-ASCII bytes anywhere in a translation unit — including in comments. Bytes that are harmless under `hipcc` (em-dashes, smart quotes, multiplication signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at preprocessing time. These regularly leak in via LLM-assisted authoring or copy/paste from formatted documents and silently break hipRTC paths that are not exercised by the default `hipcc`-based build matrix. This PR (a) cleans every existing violation (53 files) and (b) adds a pre-checkin gate so new violations are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). This is a strict superset of the existing `Clang Format` stage's predicate — `.inc` is added so test-fixture include files are also gated. The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage* in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `79798aa6261` — `[ck] Convert reflect/ rendering to ASCII for hipRTC compatibility` Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `\|- / +- / \| ` (3-col width preserved so alignment is unchanged). `conv_description.hpp` swaps `×` for `x` as the dimension separator. `test_conv_description.cpp` expected strings updated in lockstep so the snapshot test stays green. This is the only commit in the series with observable runtime impact. 2. `738fdb0d81c` — `[ck] Strip non-ASCII bytes from C++ sources for hipRTC compatibility` Mechanical text cleanup across 53 files. Replacements happen in comments or in `std::cout` strings that are not asserted on by any test. None of the 174 `.inc` files in the tree required edits, but they were in the scan's predicate so the enforcement stage's predicate is a superset of what was scanned. Full replacement table in the commit message. 3. `1d7cd8ba235` — `[ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility` - New `projects/composablekernel/script/check_ascii_only.sh` (modeled on `check_copyright_year.sh`). - New entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `ASCII Only Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the existing `Clang Format` stage but with `.inc` added to the find predicate. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 50 known violations; commit 2 leaves 0; commit 3 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains non-ASCII bytes: 1:// em-dash test — here Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.) $ echo $? 1 ``` Full repo scan after the cleanup commits (note the `-name '.inc'` clause): ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_ascii_only.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check` stage runs green over the full predicate (incl. `*.inc`) and existing `Clang Format` stage is unaffected. - [ ] `test_conv_description` passes against the ASCII tree-formatter output (touched in commit 1). - [ ] Local: `pre-commit run ascii-only-checker --all-files` runs cleanly after installing CK pre-commit hooks via `script/install_precommit.sh`. - [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. - [ ] Spot-check a representative subset of touched files under hipRTC compilation to confirm no remaining hipRTC-blocking content (optional, since the static byte check is a sufficient condition for hipRTC preprocessor acceptance on this dimension). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-04 15:00:17 +00:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
JH-Leon-KIM-AMD	00e1d82ae7	[rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9) [CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732) ## Motivation Fix silent miscompute in the grouped convolution backward-data kernel (`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride > dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast path that dropped all but the first sub-GEMM, producing zeroed slices of `dx` on the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness without giving up the perf gains PR #6208 delivered on stride=1 shapes. ## Technical Details - Tighten the flat-descriptor fast-path gate to require `arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its original purpose). For stride > 1, the implicit GEMM is split into `gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly; routing them through the flat path required dropping all but the first, which was the source of the bug. - Stride > 1 now falls through to the existing grouped CShuffle path, which packs all sub-GEMMs into one descriptor array and walks them on-device in a single kernel launch. This is the pre-PR-6208 production path; correctness is established and per-dispatch launch count is minimised. - Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0) intersection in `test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with `gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this intersection (all stride>1 cases used G=2; all G=1 cases used stride=1), which is why PR #6208's regression slipped past CI. ## Test Plan - `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke tier — runs on every PR via `smart_build_and_test.sh`). - End-to-end verify (`verify=1`) via `example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes including the original ALMIOPEN-1959 case and a cross-bucket (`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32` buckets. - ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate via an environment variable: full kernel-family enumeration, winning kernel + its avg_time reported under each gate. 33/41 shapes completed before the sweep was stopped; the remaining 8 were the largest i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape enumeration budget (not relevant to the verdict). ## Test Result ### Correctness \| Test \| Result \| \|---\|:---:\| \| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D, includes 3 new regression shapes) \| 12/12 PASSED in 14.18 s \| \| `test_grouped_convnd_bwd_data_interface` (API checks) \| PASSED in 0.28 s \| \| ALMIOPEN-1959 stride=2 (`verify=1`) \| PASSED \| \| stride=1 K3 (`verify=1`) \| PASSED \| \| stride=3 K3 `gemms_count=9` (`verify=1`) \| PASSED \| \| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) \| PASSED \| ### Performance (ckProfiler A/B on gfx942 / MI300X) Comparing the post-fix gate (flat path only when `gemms_count_==1`, column "B") vs the inner-loop variant that keeps the flat path on stride>1 (column "A") across 25 stride>1 shapes where production picks a `_v1` instance (so the gate actually fires): \| Stride \| Shapes \| A wins \| Tie \| B wins \| Notes \| \|:------:\|:------:\|:------:\|:---:\|:------:\|---\| \| 1 (sanity, gate moot) \| 3 \| 0 \| 3 \| 0 \| gate doesn't differentiate — A == B as expected \| \| > 1 (gate fires) \| 25 \| 0 \| 11 \| 14 \| B wins +6% to +32%; A never wins \| Highlights from the firing-gate cases: \| Shape (G=1, stride=2 unless noted) \| A ms \| B ms \| B vs A \| \|---\|---:\|---:\|---:\| \| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) \| 0.183 \| 0.171 \| B +6% \| \| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) \| 0.054 \| 0.045 \| B +17% \| \| i2v-010 (N=1, K=C=384, 3×3, 277×209) \| 0.174 \| 0.125 \| B +28% \| \| Synthetic 50×50 K3 N=32 K=C=256 \| 0.131 \| 0.088 \| B +32% \| Why B wins everywhere the gate fires: for `gemms_count = N`, the flat path needs N kernel launches (one per sub-GEMM), while the grouped path loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) × launch-tax is a structural disadvantage A can't recover from. ### Diff \| File \| Lines \| \|---\|---:\| \| `include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp` \| +14 / −8 (one extra condition + expanded dispatch comment) \| \| `test/.../test_grouped_convnd_bwd_data.cpp` \| +9 / −0 (3 new shapes) \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 09:59:14 +03:00
Bartłomiej Kocot	ebb97044f4	[rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1) Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664) ## Motivation Incorrect results has been introduced for some conv bwd cases. ## Technical Details This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ALMIOPEN-1959	2026-05-22 12:28:49 +00:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
Johannes Graner	3727d5220a	[rocm-libraries] ROCm/rocm-libraries#5652 (commit 7dc7d1d) [CK Conv] Wavelet gemm pipeline for bwd_weight convolution (#5652) ## Motivation In the current CShuffleV3 backward weight kernel, the in-kernel conv-to-GEMM transform generates significant INT32 VALU pressure per MFMA instruction. On VALU-heavy shapes (e.g., G=1, 3×3, C=256), these index computation ops compete with MFMA for VALU issue slots, creating a bottleneck that cannot be resolved by pipeline prefetching alone. This PR adds a wave-specialized ("wavelet") convolution backward weight kernel that splits workgroup threads into two roles: - Load waves: conv-to-GEMM address computation + global memory loads + LDS writes (all VALU/VMEM) - Math waves: LDS reads + MFMA + CShuffle epilogue (no index computation) By physically separating the two instruction classes onto different waves, VALU and MFMA execute on different hardware functional units without contention. ## Technical Details Core kernel (new files): - `gridwise_gemm_xdl_waveletmodel_cshuffle_conv_v3.hpp` — wave-specialized gridwise GEMM for conv bwd weight (2-way split: load + math) - `device_grouped_conv_bwd_weight_xdl_waveletmodel_cshuffle_v3.hpp` — device op following CShuffleV3 patterns; `BlockSize = TileMathThreadGroupSize` for MFMA wave assignment, `LaunchBlockSize = TileLoad + TileMath` for kernel launch Wave pipeline (modified): - `gridwise_gemm_waveletmodel.hpp` — load/math wave pipeline structs with `sched_group_barrier` scheduling hints to front-load VMEM reads before address-advance VALU Two wave ratios: - (4,4): 256 load + 256 math = 512 threads (8 waves). Best on large shapes. - (4,2): 256 load + 128 math = 384 threads (6 waves). Best on small shapes (fewer sync barriers, denser MFMA per math wave). Instance coverage (F16 and BF16 symmetric): \| Ratio \| Tiles \| Layouts \| ConvSpecs \| \|-------\|-------\|---------\|-----------\| \| (4,4) \| M128×N128, M64×N64, M128×N64, M64×N128 \| 2D NHWGC, 3D NDHWGC \| Default, Filter1x1Stride1Pad0 \| \| (4,2) \| M64×N64, M128×N64, M64×N128 \| 2D NHWGC \| Default, Filter1x1Stride1Pad0 \| Existing wavelet model fixes: - `BlockSize` corrected from `math::max(TileLoad, TileMath)` to `TileMathThreadGroupSize` in the flat-GEMM wavelet device op and gridwise kernel ## Test Plan - `test_grouped_convnd_bwd_weight` GTest: 34 hardcoded test cases covering 1D/2D/3D, F16/BF16, G=1/2/16, various spatial sizes - Performance benchmark: all 37 RetinaNet bwd_weight shapes on gfx950 ```bash ninja -C build test_grouped_convnd_bwd_weight ./build/bin/test_grouped_convnd_bwd_weight ``` ## Test Result Correctness: 34/34 GTest cases passed (F16/BF16 × 1D/2D/3D × Default/Filter1x1Stride1Pad0 × various G/N/K/C combinations). Performance: Wavelet is the fastest overall instance on 12/37 RetinaNet shapes — all G=1, 3×3 convolutions with C=256 (the VALU-heavy target shapes): \| Shape \| Uplift vs best baseline \| \|-------\|------------------------\| \| K=36, 7×7 \| 1.91x \| \| K=36, 100×100 \| 1.60x \| \| K=36, 13×13 \| 1.43x \| \| K=36, 25×25 \| 1.38x \| \| K=36, 50×50 \| 1.38x \| \| K=256, 100×100 \| 1.24x \| \| K=256, 13×13, s=2 \| 1.20x \| \| K=256, 25×25, s=2 \| 1.20x \| \| K=256, 7×7 \| 1.17x \| \| K=256, 13×13 \| 1.13x \| \| K=2376, 50×50 \| 1.05x \| \| K=2376, 100×100 \| 1.06x \| Where wavelet does not win (25/37): 1×1 convolutions (explicit kernel does host-side transform), grouped convolutions with small per-group channels, and shapes where standard CShuffleV3 already amortizes VALU overhead. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com>	2026-05-18 17:46:01 +02:00
JH-Leon-KIM-AMD	9a5d1ea791	[rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6) [CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208) ## Motivation Improve performance of grouped convolution backward-data kernels to match non-grouped kernel performance for G=1 cases. ## Technical Details - Add NoShuffle epilogue path (direct VGPR→Global writes) by setting `CDEBlockTransferScalarPerVector_NPerBlock = 1` - Add nongrouped-match instances with optimized BBlockTransfer parameters for better thread utilization - Add packed (flat) descriptor path for G=1 2D convolutions, using simpler tensor descriptors with fewer transform layers to reduce address computation overhead in the GEMM main loop - Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX zeroing cost) ## Test Plan - Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16) - Verify correctness with existing conv bwd data tests ## Test Result \| Metric \| Before \| After \| \|--------\|--------\|-------\| \| Mean ratio (grouped/nongrouped) \| 1.159 \| 1.028 \| \| Median ratio \| 1.142 \| 1.026 \| \| Cases within 2% \| 26 (4.4%) \| 186 (31.8%) \| \| Cases >20% slower \| 188 (32%) \| 2 (0.3%) \| NoShuffle + nongrouped-match instances achieve ~2.8% average gap with non-grouped kernels (down from ~16%). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com> Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com> Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 06:49:50 -07:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Illia Silin	22b9feb40f	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings (#7111) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
jakpiase	fc39a02cda	[rocm-libraries] ROCm/rocm-libraries#6624 (commit 47d0162) [CK_TILE] Grouped Convolution Backward Data Direct Load (#6624) ## Proposed changes Add Grouped Convolution Backward Data with Direct Load into DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation. This enables direct global memory loading (bypassing LDS) for the backward data convolution path on gfx950, following the same pattern used in both backward weight and forward convolution. Direct load convolution backward data improves performance by avoiding LDS round-trips for certain configurations on gfx950, which supports a wider range of instructions. Currently correctness is checked only at usage point, but should be extended to a standalone UT in the future.	2026-04-23 11:16:55 +02:00
Illia Silin	d16061f578	[rocm-libraries] ROCm/rocm-libraries#6550 (commit c396de9) [CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
Ville Pietilä	c7fe8b72c6	[rocm-libraries] ROCm/rocm-libraries#6421 (commit 05b0753) [MIOpen][CK] Fix bwd weight conv test failures by disabling one block-GEMM V5 instance for 3D convs (#6421) ## Motivation Due to compiler version update, there are test failures in the test target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current develop branch, the test failures are blocking any PR merges into develop. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as the tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. First attempt to fix this problem had to be reverted in #6400 because it broke MIOpen internal DB sync tests. ## Technical Details The root cause for the test failures are the block-GEMM V5 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size. The V5 pipeline uses double register buffer that in combination with large tile size causes high register pressure. The latest version of compiler handles the register spillage incorrectly for `gfx90a`, which cause the kernel to output incorrect results. The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that do not use direct load for are divided into two groups - Base instances - Instances that result into high register usage (currently only one instance - one that causes the test failures). This division allows to disable only the V5 block-GEMM flavor of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The selective disabling leaves the set of instances for 1D and 2D convolutions unaffected, and removes at runtime two V5 block-GEMM instances (`ConvBwdWeightDefault` and `ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the device is `gfx90a`. Because MIOpen uses CK's type string (provided by method `GetTypeString`) to identify the instances, the DB sync tests are expected to unaffected since there are still the V2 block-GEMM instances that result in the same type string (`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by running the MIOpen DB sync tests that are not part of the normal CK PR build. ## Test Plan Running all CI tests + the MIOpen internal DB sync tests is sufficient to verify the correctness of the code changes. ## Test Result Verified locally that the previously failing tests `TestGroupedConvndBwdWeight3d/4.Test3D` and `TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts - 231 on `gfx90a` - 233 on `gfx942` and are currently passing. This confirms the expectation that two instances per data type should be disabled on `gfx90a`. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Ville Pietilä <>	2026-04-17 09:16:32 +03:00
Brock Hargreaves	a4e36c1b89	[rocm-libraries] ROCm/rocm-libraries#6400 (commit c0b3c95) [MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400) Reverts ROCm/rocm-libraries#6343 This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.	2026-04-13 20:46:07 -06:00
Ville Pietilä	0f2279920b	[rocm-libraries] ROCm/rocm-libraries#6343 (commit 3604475) [CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343) ## Motivation Due to compiler version update, there are test failures in the test suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current `develop` branch, the test failures are blocking any PR merges into `develop`. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. ## Technical Details This PR disables the compilation of the problematic bwd weight conv instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that allows us to detect when we are compiling for `gfx90a`. Using the new CMake flag, compilation of instance `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`. Co-authored-by: Ville Pietilä <>	2026-04-13 13:40:27 +02:00
Aviral Goel	c7eb33078c	[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8) CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00
JP-Fernando	c31cf4fb4b	[rocm-libraries] ROCm/rocm-libraries#4591 (commit d34e981) [CK] Add BF16^3 support to grouped conv bwd weight: bilinear and scale (#4591) ## Motivation Until now, XDL grouped conv bwd weight for bilinear and scale only supported bf16f32bf16. Therefore, bf16bf16bf16 support should be added. ## Technical Details Instances were added to the relevant files in `library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_bwd_weight/` folder. In addition, `add()` functions were included in new files in `library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/xdl/` and `library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_scale/xdl/` folders. The new .cpp files were also included in the `CMakeFiles.txt` files of both folders. ## Test Plan Execute `grouped_convnd_bwd_weight` tests to check execution on different architectures. The tests for bilinear and scale already include the tuple `std::tuple<ck::half_t, ck::half_t, ck::half_t, ck::Number<3>>`, so in principle, there is nothing to modify in the tests themselves. ## Test Result `gfx1201`: Tests passed. `gfx1100`: Tests passed. `gfx90a`: Tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-03-11 13:05:44 +01:00
Christopher Millette	70360dc74f	[rocm-libraries] ROCm/rocm-libraries#5030 (commit 8e02a26) [CK] Replace tuple value construction with tuple_element_t type extraction [1A] (#5030) ## Summary ### Rationale CK's device operation instance registration uses `add_device_operation_instances` at ~1,850 call sites to register GPU kernel configurations. The existing implementation constructs `std::tuple` values just to extract their types via `decltype`, then copy-constructs each instance into `make_unique`. This is wasteful — only the types matter, not the values — and forces the compiler to instantiate the full `std::tuple` constructor and `std::get` machinery at every call site. ### What changed - Replace `remove_cvref_t<decltype(std::get<i>(tuple_obj))>` with `std::tuple_element_t<i.value, TupleType>`, which extracts the type directly without constructing any values - Replace copy-from-default `make_unique<T>(value)` with direct default construction `make_unique<T>()` — all CK device operation instances are stateless structs with configuration encoded in template parameters - Add `static_assert(std::is_default_constructible_v<NewOpInstance>)` to enforce this contract at compile time with a clear error message - Add Doxygen documentation for this high-traffic public API ### Value - Eliminates unnecessary template instantiation of `std::tuple` constructors and `std::get` across ~1,850 call sites - Establishes a cleaner, more intention-revealing pattern for type-only tuple usage - The `static_assert` prevents silent breakage if a non-default-constructible type is ever added - No runtime behavior change — zero risk ### Files changed (9) - `add_device_operation_instance.hpp`: Core pattern change - 3 example files, 3 reduce instance headers, 1 convolution header, 1 profiler header ## Test plan - [ ] Existing CI tests cover all ~1,850 call sites (GEMM, reduce, softmax, convolution) - [ ] `static_assert` provides compile-time validation stronger than runtime tests - [ ] No runtime behavior change — stateless struct default construction is identical to copy-from-default - [ ] Compatible with both `std::tuple` and `ck::type_list` containers 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-06 09:27:27 -07:00
Bartłomiej Kocot	a67aaa1b96	[rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2) [CK] Port non-grouped convolution instances to the grouped kernels (#4875) ## Motivation Port non-grouped convolution instances to the grouped kernels to deprecated older non-grouped implementations. ## Technical Details Add the same instances as non-grouped but using grouped kernel. ## Test Plan test_grouped_convnd_fwd ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-724	2026-02-28 01:24:30 +00:00
Yung-sheng Tu	743552b6fd	[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f) Implement device_grouped_gemm_fixed_nk_bias for RDNA4 (#4340) ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-26 00:28:09 +00:00
Zoltán Lakatos	cb60fdd58d	[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f) [CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425) ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-25 05:16:07 +00:00
assistant-librarian[bot]	263288a383	[rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49) 173 implement device grouped gemm fixed nk for rdna4 (#4299) ## Proposed changes This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk instance library using for WMMA. The implementation is based on the existing DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level structure, but replaces the XDL kernel with a WMMA-based one. It uses the GridwiseGemm_wmma_cshuffle_v3 kernel. At this stage, the focus is functional correctness and compatibility, not performance tuning. ## Technical Details - Device struct for grouped gemm fixed NK - Example code for the WMMA version - Unit tests for both new wmma implementation and the reference XDL code (previously missing) - Generic ck profiler interface with the purpose of calling unit tests. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3668](https://github.com/ROCm/composable_kernel/pull/3668) 🧑‍💻 Originally authored by @bidlekm --------- Co-authored-by: Marton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: bidlekm <bidlekmarton@gmail.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-19 09:13:05 +01:00
Ville Pietilä	6b9df93342	[rocm-libraries] ROCm/rocm-libraries#4652 (commit 39a5a53) Revert "[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size." (#4652) PR ROCm/rocm-libraries#4275 contains CK fwd conv instances optimized for `gfx950` and they do not compile for other architectures such as `gfx940`. To ensure that the optimized instances are compiled only for `gfx950`, compile-time guard `#if defined(CK_USE_GFX950)` was used. This approach works correctly when we compile for a single architecture, but when we compile simultaneously for multiple architectures, flag `CK_USE_GFX950` is set for non-gfx950 archs as well. As a result, the multi-arch compilation fails. The problem doesn't appear in the ROCm libraries CI/CD pipeline since only one architecture is compiled at a time. Hence, the CI/CD passed for the original PR. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-18 17:02:13 -07:00
assistant-librarian[bot]	788d66025d	[rocm-libraries] ROCm/rocm-libraries#4275 (commit 2e07a39) [CK] Add new fwd conv fp16/bf16 instances optimized for unit group size. (#4275) ## Proposed changes Added new FP16/BF16 instances that are optimized for group size = 1. The new instance use the compute optimized block GEMM pipeline. \| CK prof command \| Baseline (TFLOPs) \| New V3 instances (TFLOPs) \| \|:-----\|:------:\|------:\| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1 1 1 \| 858.818 \| 962.293 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 256 256 3 3 100 100 1 1 1 1 1 1 1 1 \| 979.987 \| 1121.11 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 50 50 1 1 1 1 1 1 1 1 \| 945.951 \| 1091.66 \| --- 🔁 Imported from [ROCm/composable_kernel#3670](https://github.com/ROCm/composable_kernel/pull/3670) 🧑‍💻 Originally authored by @vpietila-amd --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-02-17 17:58:11 -07:00
Bartłomiej Kocot	ea4942cd02	[rocm-libraries] ROCm/rocm-libraries#4506 (commit d9ccef7) Revert "[CK Conv] Add bwd weight instance for large-k shape" (#4506) Reverts ROCm/rocm-libraries#4266 due to CI failures. Should be investigated by @johannes-graner	2026-02-11 21:37:50 +00:00
Johannes Graner	40cec769ce	[rocm-libraries] ROCm/rocm-libraries#4266 (commit 1d8094d) [CK Conv] Add bwd weight instance for large-k shape MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes This instance improves the shape used in `./bin/ckProfiler grouped_conv_bwd_weight 1 2 0 2 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1 1 1 all` from 10.3 ms to 6.6 ms. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered	2026-02-10 16:58:04 +00:00
Ville Pietilä	57d26db844	[rocm-libraries] ROCm/rocm-libraries#4273 (commit 591f504) [CK] Add fwd conv group merging to v3 conv instances MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Added conv group merging to the (universal) V3 fwd conv pipeline. The new instance improves fwd conv performance when the number of input/output channel per group is low. On MI300 (`gfx942`) we get \| CK prof command \| Baseline (TFLOPS) \| V3 group merging (TFLOPS) \| \|:-----\|:------:\|------:\| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 \| 3.86035 \| 8.36796 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 \| 10.1867 \| 13.4677 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1 \| 11.7875 \| 16.3657 \|	2026-02-08 11:35:56 +00:00
Illia Silin	569640dc70	Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 )" (#3705 ) This reverts commit `301eb5cf08`.	2026-02-03 09:52:14 -08:00
Zoltán Lakatos	301eb5cf08	Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 ) * device struct implementation * added xdl grouped multi abd fixed nk testing * wmma implementation fixed * avoid unnecessary device mem allocation and code cleanups * cleanup instances definitions * wmma examples added * code cleanups * fix clang format * typo and compilation fixes related to reference gemm * fix compilation error due to std::remove_cvref_t * added missing hip_check_error includes * correction to example instances * review commentes addressed * removed split-k from testing * code formatting --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-02 13:58:11 -08:00
Kiefer van Teutem	2377a62837	Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 ) * Adding remaining flavors for grouped conv fwd As titled. Following variants are added: - grouped_conv2d_fwd_dynamic_op - grouped_conv3d_fwd_dynamic_op - grouped_conv3d_fwd_bilinear - grouped_conv3d_fwd_convscale - grouped_conv3d_fwd_convinvscale - grouped_conv3d_fwd_convscale_add - grouped_conv3d_fwd_convscale_relu - grouped_conv3d_fwd_scale - grouped_conv3d_fwd_combconvscale - grouped_conv3d_fwd_scaleadd_scaleadd_relu * Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3. * Do not build f8 / bf8 only flavor tests on RDNA3 * Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time. * Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream. * adding int8 and fp16 overloads to the elementwise operations * fixed copilot nits * Addressing review comments: - removed unnecessary examples for dynamic op - removed unnecessary conv specalizations for all the flavors - removed spurious bilinear and scale source files * clang-format * reduced no of tests --------- Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>	2026-01-30 17:02:14 +01:00
Bartłomiej Kocot	83b58bb0c3	Grouped Conv Bwd Weight Direct Load (#3648 ) * Grouped Conv Bwd Weight Direct Load * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp * Implement group merging for bwd_weight and add instances * Link direct load instances * builder fixes * fix * fixes * fix --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com>	2026-01-28 15:31:54 -06:00
Johannes Graner	c190d8d61f	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible	2026-01-27 09:49:42 +01:00
yinglu	8942a19d5e	ck: add CK_USE_GFX950 macro (#3636 )	2026-01-26 11:38:45 -08:00
Ville Pietilä	7ac3794284	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <>	2026-01-25 13:42:23 +01:00
Wojciech Laskowski	2e08a7e5ab	WMMA grouped conv fwd large tensor bias bnorm clamp (#3595 ) * Added bias_bnorm_clamp for WMMA conv fwd large tensor. Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp * changed strategy to handle GemmArgs array * Adding generic instance * fixed last nits from reviewers and copilot	2026-01-23 12:20:00 +01:00
Wojciech Laskowski	81ee19bd2c	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops	2026-01-23 12:19:51 +01:00
Bartłomiej Kocot	7b3db1a878	Grouped conv fwd direct load vector=2 (#3632 )	2026-01-23 10:29:59 +01:00
ApoorvaKalyani	8daf6ea302	Grouped conv_fwd_bias_bnorm_clamp instances and tests (#3525 ) * Added bias_bnorm_clamp instances. * fwd_bias_bnorm_clamp comp instances * fwd_bias_bnorm_mem_inter and mem_intra instances * fwd_bias_bnorm_merged_group_instances * fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances * Device level changes for fwd_bias_bnorm_clamp * Added the test to the regression test list. * Removed the part 2 and 2x instances * Removed the irrelevant checks in wmma * Refactored the instances to adapt to new device implementation * Updated the reference and include files * enabling tests * Added missing profiler * Added missing instance entry , deleted by mistake * Reduce bias bnorm clamp instances to only a single generic one. * Clean up cmakelists file * clang-format * Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up. * Renaming some instance lists and add functions to be more standardized. * Commented out non default instances. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2026-01-22 09:53:59 +01:00
Erwin Terpstra	d5ae81b292	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2026-01-20 13:06:59 -08:00
Estevan Vedovelli	7d8bca7ddc	Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598 ) * Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions Enables hipTensor to access the WMMA HW functionalities for these combinations of datatype on gfx11 and gfx12. * Fix change to contraction scale tests * Fix clang-format	2026-01-20 09:39:57 -08:00
Erwin Terpstra	fe40a5d139	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method	2026-01-17 08:30:27 +01:00
Yung-sheng Tu	6df2d70143	Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 ) * add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples	2026-01-15 07:19:31 -08:00
Enrico Degregori	693ff3bbb3	Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465 ) - Add support for direct store in epilogue instead of cshuffle - Add padding support for wave transfer without transpose - Add wave transfer with interleaved layout to support direct store - Enable new functionalities on GEMMs - Add optional new functionality support for grouped convolution fwd - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)	2026-01-14 11:02:19 +01:00
Erwin Terpstra	eb041079a3	Implement grouped gemm tile loop for RDNA4 (#3304 ) * feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3	2026-01-13 07:14:23 +01:00
Enrico Degregori	aad4cf0985	Wmma support for gemm_bias_add_reduce (#3316 ) * Add tests for gemm_bias_add_reduce * Initial working implementation * Generalize implementation of reduce epilogue * Add tests for all layouts * Add instances * Fix test archs * Fix xdl bug * Remove library/profiler duplications * Fix num_byted error profiler * Fix typos * Fix copyright	2026-01-07 10:27:16 -08:00
Erwin Terpstra	f9c6ba0403	Implement grouped gemm fastgelu for RDNA4 (#3303 ) * Implement grouped gemm fastgelu for RDNA4 * chore: some cleanup and minor inconsistencies in grouped gemm profiler * chore: clarified logic and reporting of supported instance warnings	2026-01-07 10:20:44 -08:00
ApoorvaKalyani	53a1e4f551	Grouped convolution backward data WMMA v3 implementation (#3460 ) * Added device level implementation for bwd_data_wmma_v3. * Added first instance of bwd_data_wmma_v3(f16). * Add support for bwd data in gridwise implementation Some changes are general for convolution and some are specific for bwd data. We need to generalize them once we have fwd, bwd data and bwd weight * Initial device implementation of bwd data * Remove unused template parameters in device impl * Add one instance for different layout initial check of device implementation * Add tests for splitk and for different layouts * Appended more instances to wmma_v3_f16. * Added conv_2d bf16 wmma_v3 instances. * Added conv_3d_bf16 wmma_v3_instances. * Added conv_3d_f16_wmma_v3_instances. * Added SplitN test cases for wmma. * Conv3d_bwd_data_scale_wmma_v3 instances. * Conv3d_bwd_data_bilinear_wmma_v3_instances * Renaming the device level instances file to common name , since it is defined for different DataTypes. * Renaming the instances and fixing typo * Added the test cases to regression test list * NCHW support for wmma_v3 * Examples for bf16 and f16 bwd_data_wmma_v3 * Added transpose conditons for device impl * fixing bugs * Added the gemm_args array implmentation * WIP debug conv bwd * fix splitk * Grouped gemm fix * Update CmakeLists with EOF * Added more instances for tests * Fixed the run time error in examples and removed 3d conv examples. * Fixed a typo. * Updated CmakeLists to removed the 3d convultion deleted files * Added print error statements for unsupoorted argument * Added the merge conflict related changes * Fixed compilation error * Fixed the InstanceFactory duplication error. * Removed the print statements and added logs to Arg function * All the merge conflict related errors resolved * Added d_tensor tests. * Added the missing example types of wmm_v3 * Merge error fix * Corrected the instance name * Reverted the bias relu change * Revereted the transpose load local change * Updated the regression test list with bwd_data_scale * Revert "Revereted the transpose load local change" This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b. * Revert "Merge error fix" This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b. * Reverting the local change * Added merge error fix * Build error fix due to merge conflicts * Added bias_relu example for wmma_v3 * Modified the main method in dtensor tests * Updated the dtensor tests to pick all the shapes * Updated the dtensor test shapes. * Updated the mem operations in tests. * Added reference func * Fixed typos in device impl * Added new header file and modified the include file for 3d tests * Renamed the test file and added reference func call. * clang format fix * Added ignore params * Modified device impl and tests * Removed debug print statements and updated dtensor test shapes * Fixing merge conflicts * Fixing more merge conflicts * Fixed copyrights * Updated the tuned instances to bilinear and scale. * Adding tuned instances to vanilla wmma_v3 * Removed all unused instances and modified test layouts. * Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances. * Fix clang format * Updated tuned f16/-genric instances * Formatting the instances file * Fixed copyrights and clang issues * Nonsense commit to force git to force * Removed the transpose instances * Added verified genric instances * Fixing namespace errors * Added todo for failing shapes * Formatting instance file * Fix instance list formatting * Removing unnecessary formats * Renamed the common file * Unification of xdl and wmma bwd_data tests * Updated Cmake * Added all layout types and deleted code. * Updated Cmake to add the condition to all tests. --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2025-12-30 16:25:08 +01:00
Kiefer van Teutem	88ae445580	Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470 ) * Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests. * Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.	2025-12-29 12:58:29 +01:00

1 2 3 4 5 ...

451 Commits