composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 00:39:02 +00:00

Author	SHA1	Message	Date
Yi DING	01bd52bdb5	[rocm-libraries] ROCm/rocm-libraries#7925 (commit a8f0845) [CK] Fix gfx950 AITER Sync Regressions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Fixes three gfx950 regressions in the AITER downstream CI that surfaced after the internal/gfx1250 re-sync (ROCm/rocm-libraries#6978): > Companion aiter PR: ROCm/aiter#3392 — host-side adaptations (`Kernel::BlockSize()` `constexpr` drops, blockscale `KBatch=1` clamp) plus the CK submodule bump used to validate these fixes together. - FlyDSL MoE AOT cache miss — the AITER MoE tests run with `check_aot_cache=True` and fail on any FlyDSL JIT cache miss, but the CI never pre-compiles the FlyDSL MoE kernels, so gfx950 always misses. Pre-compile them at the start of the AITER test stage. - `buffer.load.lds.v4i32` link error — ROCm/rocm-libraries#6978 reintroduced a clang-version guard mapping `llvm.amdgcn.raw.buffer.load.lds` to a `.v4i32`-suffixed name. That name exists in no LLVM (the rsrc operand is a fixed, non-overloaded `<4 x i32>`, so the intrinsic is never type-mangled), so gfx950 4-DWORD direct-to-LDS (e.g. fp4 MoE bpreshuffle) fails to link with `lld: undefined symbol: llvm.amdgcn.raw.buffer.load.lds.v4i32`. Use the canonical plain name unconditionally. - mixed-precision flatmm warp-GEMM call — ROCm/rocm-libraries#6978 generalized the scaled `WarpGemmImpl::operator()` from a fixed `<index_t opselA, index_t opselB>` signature to a variadic `<typename... Params>` one and updated the `mx_flatmm` pipeline to pass the op-selectors as `OpSelA<>`/`OpSelB<>` types, but missed the mixed-precision flatmm pipeline (`F8xMXF4`/`F16xMXF4`), which still passed raw integer op-selectors. These no longer bind to `typename... Params` (`error: no matching member function for call to 'operator()'`), breaking compilation of the fp8/bf16 × fp4 cktile MoE gemm1 instances on gfx950 (aiter `test_moe_2stage`). Wrap the op-selectors in `OpSelA<>`/`OpSelB<>`. ## Changes - `Jenkinsfile`: pre-compile the FlyDSL MoE AOT cache (`python3 aiter/aot/flydsl/moe.py`) before the AITER tests. - `include/ck/utility/amd_buffer_addressing_builtins.hpp` and `include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp`: drop the `__clang_major__` guard and always use `__asm("llvm.amdgcn.raw.buffer.load.lds")`. The plain name is the canonical one for all sizes including the gfx950 16-byte form, as the upstream LLVM gfx950 tests confirm. - `include/ck_tile/ops/flatmm/pipeline/mixed_prec_flatmm_pipeline_agmem_bgmem_creg_v1.hpp`: wrap the warp-GEMM op-selectors in `OpSelA<>`/`OpSelB<>` at the five call sites, matching the `mx_flatmm` pipeline. ## Test plan Validated via CI.	2026-06-03 02:09:05 +00:00
Maksim (Max) Podkorytov	d574cc4757	[rocm-libraries] ROCm/rocm-libraries#6696 (commit 9627b91) Replace nested static_for lambdas with compile-time search helper (#6696) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add `sequence_find_value` and `find_in_tuple_of_sequences` compile-time search helpers with O(1) template depth - Replace nested `static_for` lambdas in `TensorDescriptor::GetTransformAndItsUpperDimension` and `InitializeElementSize` - Apply same optimizations to `TensorAdaptor` Supersedes #4287. Conflict-resolved rebase of ROCm/composable_kernel#3600 onto current develop. ## Motivation The `TensorDescriptor` and `TensorAdaptor` classes had excessive template instantiation from: 1. Nested `static_for` loops with lambdas creating unique closure types at every call site 2. `generate_tuple` with lambdas causing per-type instantiation overhead The new helpers use constexpr array lookup and pack expansion instead of recursive template patterns, achieving O(1) template depth. ## Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`) \| TU \| Baseline (mean) \| New (mean) \| Delta \| Wilcoxon p \| Mann-Whitney p \| \|----\|-----------------\|------------\|-------\|-----------\|---------------\| \| `grouped_conv_fwd_xdl_fp16` (host) \| 14,886 ms \| 13,353 ms \| -10.3% \| 0.002 \| 0.0002 \| \| `grouped_conv_fwd_xdl_fp16` (device) \| 27,762 ms \| 25,629 ms \| -7.7% \| 0.002 \| 0.0002 \| \| Total (all TUs) \| 57,732 ms \| 54,030 ms \| -6.4% \| \| \| Unrelated TUs (`device_memory`, `host_tensor`, `convolution_parameter`) show no significant difference (p > 0.3), serving as negative controls. ### Methodology - 10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the same node to eliminate ordering/warmup bias - Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U test (unpaired) - Built with patched clang (LLVM 22) on ctr2-alola-compile-11, `-j1` for accurate per-TU timing - Raw data available in Slurm job 275230 results ## Test plan - [x] 11 unit tests added (5 for `sequence_find_value`, 6 for `find_in_tuple_of_sequences`) - [x] Compile-time benchmark with statistical significance (p < 0.01) - [ ] Full CI Tracking issue: #4229	2026-06-02 23:15:10 +00:00
Sami Remes	919096fde8	[rocm-libraries] ROCm/rocm-libraries#7935 (commit 5c96097) [CK] Allow skipping split-K C-buffer zero-init in xdl_cshuffle blockscale GEMM (#7935) Add a `skip_zero_init` flag (default false) to the Problem/Argument of the xdl_cshuffle block-scale GEMM device ops (multiple_d ab_scale and blockscale b-preshuffle). When the flag is set, the device invoker skips the internal hipMemsetAsync that zeroes p_c_grid before the KBatch > 1 split-K atomic-accumulation path. The flag is declared on the gridwise Problem struct (inherited by Argument), so it is visible on both the rotating-cache (arg_) and the normal (arg) launch paths in each device op. Why: callers that already pre-zero the output buffer otherwise pay for a redundant device-wide memset before split-K atomic accumulation. Gating the memset behind an opt-in flag lets such callers avoid the duplicate work. Because the flag defaults to false, every existing call site is unaffected and the observable behavior is unchanged. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-02 13:08:46 +00:00
Bartłomiej Kocot	5d912538d3	[rocm-libraries] ROCm/rocm-libraries#7847 (commit b995ef2) [CK] Remove IsPackedTensor function ## Motivation Fix codegen hipRTC ## Technical Details Remove not needed function. Since MakeArgument supports long_index_t strides. ## Test Plan Codegen tests. ## Test Result Passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 14:00:06 +00:00
Zoltán Lakatos	58e2ab1fc7	[rocm-libraries] ROCm/rocm-libraries#6761 (commit d19f6f1) [CK] Large tensor gemm workaround (#6761) ## Motivation Customer qeruested large tensor gemm support for 8bit and 4bit data types. Currently CK triggers “This GEMM not supported” error. The root cause appears to be the 2 GB limit on the input/output matrix, triggered by buffer offset constraints when testing a larger shape such as M = 699,904 (which is an exact multiple of MPerBlock = 256). ## Technical Details Quick workaround to have support ASAP. Split the tensors into inputs / outputs smaller than 2GB limit. Iterate on host and call all subproblems without device code change. Support is restricted to rowise layout in A, Ds and E All changes were implemented in DeviceGemm structures to avoid secondory affect on grouped convolutions. Got lots of AI generated comments. Addressed the ones that seemed relevant on the functionality. ## Test Plan Within CK the following examples can be used with modified input sizes: example_gemm_multiply_multiply_xdl_fp8 example_gemm_mx_fp4 Tested with Aiter tuning on provided shapes. ## Test Result All gemms run and provide correct results. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-05-27 18:55:15 +00:00
Michael Halkenhäuser	5dc8fbd1a8	[rocm-libraries] ROCm/rocm-libraries#6900 (commit `28608c2`) [CK] Fix and expand CK's commit records in version.h (#6900) ## Motivation In `version.h` of a CK installation `CK_COMMIT_ID` would be empty for out-of-source builds. Additionally, if it worked, it would show the parent repo's (`rocm-libraries`) commit. ## Technical Details Dropped "required" constraint so "unknown" string becomes a graceful option. Changed process of determining the CK commit, now uses `WORKING_DIRECTORY`. Thus, `CK_COMMIT_ID` holds only the last CK-relevant commit. Added `CK_PARENT_COMMIT_ID` which holds the parent's, e.g. `rocm-libraries`, commit. This can be the same as `CK_COMMIT_ID`, or not even applicable, depending on the scenario. ## Test Plan Ran CMake configuration and installation of CK to verify happy path. ## Test Result Commit SHA's showed the expected values depending on the repo state. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 10:17:02 -07:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
Bartłomiej Kocot	60df79085d	[rocm-libraries] ROCm/rocm-libraries#7631 (commit d591a7c) [CK] Grouped Convolution Global Load/Store support (#7631) ## Motivation Grouped Convolution Global Load/Store support to cover large tensor cases. ## Technical Details Utilize global load for grouped convolution forwad kernels. Update Indexes to use int64. ## Test Plan - test utils - test conv kernels in next pr with instances ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1255 --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-27 08:21:54 +00:00
JH-Leon-KIM-AMD	00e1d82ae7	[rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9) [CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732) ## Motivation Fix silent miscompute in the grouped convolution backward-data kernel (`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride > dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast path that dropped all but the first sub-GEMM, producing zeroed slices of `dx` on the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness without giving up the perf gains PR #6208 delivered on stride=1 shapes. ## Technical Details - Tighten the flat-descriptor fast-path gate to require `arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its original purpose). For stride > 1, the implicit GEMM is split into `gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly; routing them through the flat path required dropping all but the first, which was the source of the bug. - Stride > 1 now falls through to the existing grouped CShuffle path, which packs all sub-GEMMs into one descriptor array and walks them on-device in a single kernel launch. This is the pre-PR-6208 production path; correctness is established and per-dispatch launch count is minimised. - Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0) intersection in `test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with `gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this intersection (all stride>1 cases used G=2; all G=1 cases used stride=1), which is why PR #6208's regression slipped past CI. ## Test Plan - `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke tier — runs on every PR via `smart_build_and_test.sh`). - End-to-end verify (`verify=1`) via `example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes including the original ALMIOPEN-1959 case and a cross-bucket (`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32` buckets. - ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate via an environment variable: full kernel-family enumeration, winning kernel + its avg_time reported under each gate. 33/41 shapes completed before the sweep was stopped; the remaining 8 were the largest i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape enumeration budget (not relevant to the verdict). ## Test Result ### Correctness \| Test \| Result \| \|---\|:---:\| \| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D, includes 3 new regression shapes) \| 12/12 PASSED in 14.18 s \| \| `test_grouped_convnd_bwd_data_interface` (API checks) \| PASSED in 0.28 s \| \| ALMIOPEN-1959 stride=2 (`verify=1`) \| PASSED \| \| stride=1 K3 (`verify=1`) \| PASSED \| \| stride=3 K3 `gemms_count=9` (`verify=1`) \| PASSED \| \| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) \| PASSED \| ### Performance (ckProfiler A/B on gfx942 / MI300X) Comparing the post-fix gate (flat path only when `gemms_count_==1`, column "B") vs the inner-loop variant that keeps the flat path on stride>1 (column "A") across 25 stride>1 shapes where production picks a `_v1` instance (so the gate actually fires): \| Stride \| Shapes \| A wins \| Tie \| B wins \| Notes \| \|:------:\|:------:\|:------:\|:---:\|:------:\|---\| \| 1 (sanity, gate moot) \| 3 \| 0 \| 3 \| 0 \| gate doesn't differentiate — A == B as expected \| \| > 1 (gate fires) \| 25 \| 0 \| 11 \| 14 \| B wins +6% to +32%; A never wins \| Highlights from the firing-gate cases: \| Shape (G=1, stride=2 unless noted) \| A ms \| B ms \| B vs A \| \|---\|---:\|---:\|---:\| \| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) \| 0.183 \| 0.171 \| B +6% \| \| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) \| 0.054 \| 0.045 \| B +17% \| \| i2v-010 (N=1, K=C=384, 3×3, 277×209) \| 0.174 \| 0.125 \| B +28% \| \| Synthetic 50×50 K3 N=32 K=C=256 \| 0.131 \| 0.088 \| B +32% \| Why B wins everywhere the gate fires: for `gemms_count = N`, the flat path needs N kernel launches (one per sub-GEMM), while the grouped path loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) × launch-tax is a structural disadvantage A can't recover from. ### Diff \| File \| Lines \| \|---\|---:\| \| `include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp` \| +14 / −8 (one extra condition + expanded dispatch comment) \| \| `test/.../test_grouped_convnd_bwd_data.cpp` \| +9 / −0 (3 new shapes) \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 09:59:14 +03:00
chris-tsiaousis-hpc	c7fac341de	[rocm-libraries] ROCm/rocm-libraries#4871 (commit 7d4c040) [CK] Decouple EpilogueArgs from GridwiseGemm implementation (#4871) This is duplicate of #4537. I could not re-open it since te target branch got deleted and could not change the target branch since it was closed... :) ## Motivation Right now, all the Epilogues structs are declared inside the base gridwise struct. They should be independent of it and the specialization of the selected Epilogue Type should be declared within the the kernel function. ## Technical Details All Epilogue structs depend on template parameters that are known to the base Gridwise Gemm struct. In this PR, we export them to be used independently by any struct that might need to extract them. This approach will serve the decoupling purposes for the Epilogues, but also enable future constructs to use and expand this approach. See 30e2a4c01b64bdea68857c7badd9d7cffbf1adb9. Right now an issue that arises is that when implementing a new Epilogue Type, the developer is not forced to decide where this struct should/can be used or not. To fix this I propose defining an `enum struct EpilogueType` that will be used to fetch the Epilogue specialization through a helper struct. See a943ac8d130e12d6843715b322181186e54ba15c. Note that all the instantiation details will stay in this helper struct. Also note the static assertion in the else statement. ## Test Plan Test with existing CI, as nothing is added/removed. ## Test Result All relevant existing CI tests should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-05-22 18:39:01 +00:00
Bartłomiej Kocot	ebb97044f4	[rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1) Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664) ## Motivation Incorrect results has been introduced for some conv bwd cases. ## Technical Details This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ALMIOPEN-1959	2026-05-22 12:28:49 +00:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
Johannes Graner	3727d5220a	[rocm-libraries] ROCm/rocm-libraries#5652 (commit 7dc7d1d) [CK Conv] Wavelet gemm pipeline for bwd_weight convolution (#5652) ## Motivation In the current CShuffleV3 backward weight kernel, the in-kernel conv-to-GEMM transform generates significant INT32 VALU pressure per MFMA instruction. On VALU-heavy shapes (e.g., G=1, 3×3, C=256), these index computation ops compete with MFMA for VALU issue slots, creating a bottleneck that cannot be resolved by pipeline prefetching alone. This PR adds a wave-specialized ("wavelet") convolution backward weight kernel that splits workgroup threads into two roles: - Load waves: conv-to-GEMM address computation + global memory loads + LDS writes (all VALU/VMEM) - Math waves: LDS reads + MFMA + CShuffle epilogue (no index computation) By physically separating the two instruction classes onto different waves, VALU and MFMA execute on different hardware functional units without contention. ## Technical Details Core kernel (new files): - `gridwise_gemm_xdl_waveletmodel_cshuffle_conv_v3.hpp` — wave-specialized gridwise GEMM for conv bwd weight (2-way split: load + math) - `device_grouped_conv_bwd_weight_xdl_waveletmodel_cshuffle_v3.hpp` — device op following CShuffleV3 patterns; `BlockSize = TileMathThreadGroupSize` for MFMA wave assignment, `LaunchBlockSize = TileLoad + TileMath` for kernel launch Wave pipeline (modified): - `gridwise_gemm_waveletmodel.hpp` — load/math wave pipeline structs with `sched_group_barrier` scheduling hints to front-load VMEM reads before address-advance VALU Two wave ratios: - (4,4): 256 load + 256 math = 512 threads (8 waves). Best on large shapes. - (4,2): 256 load + 128 math = 384 threads (6 waves). Best on small shapes (fewer sync barriers, denser MFMA per math wave). Instance coverage (F16 and BF16 symmetric): \| Ratio \| Tiles \| Layouts \| ConvSpecs \| \|-------\|-------\|---------\|-----------\| \| (4,4) \| M128×N128, M64×N64, M128×N64, M64×N128 \| 2D NHWGC, 3D NDHWGC \| Default, Filter1x1Stride1Pad0 \| \| (4,2) \| M64×N64, M128×N64, M64×N128 \| 2D NHWGC \| Default, Filter1x1Stride1Pad0 \| Existing wavelet model fixes: - `BlockSize` corrected from `math::max(TileLoad, TileMath)` to `TileMathThreadGroupSize` in the flat-GEMM wavelet device op and gridwise kernel ## Test Plan - `test_grouped_convnd_bwd_weight` GTest: 34 hardcoded test cases covering 1D/2D/3D, F16/BF16, G=1/2/16, various spatial sizes - Performance benchmark: all 37 RetinaNet bwd_weight shapes on gfx950 ```bash ninja -C build test_grouped_convnd_bwd_weight ./build/bin/test_grouped_convnd_bwd_weight ``` ## Test Result Correctness: 34/34 GTest cases passed (F16/BF16 × 1D/2D/3D × Default/Filter1x1Stride1Pad0 × various G/N/K/C combinations). Performance: Wavelet is the fastest overall instance on 12/37 RetinaNet shapes — all G=1, 3×3 convolutions with C=256 (the VALU-heavy target shapes): \| Shape \| Uplift vs best baseline \| \|-------\|------------------------\| \| K=36, 7×7 \| 1.91x \| \| K=36, 100×100 \| 1.60x \| \| K=36, 13×13 \| 1.43x \| \| K=36, 25×25 \| 1.38x \| \| K=36, 50×50 \| 1.38x \| \| K=256, 100×100 \| 1.24x \| \| K=256, 13×13, s=2 \| 1.20x \| \| K=256, 25×25, s=2 \| 1.20x \| \| K=256, 7×7 \| 1.17x \| \| K=256, 13×13 \| 1.13x \| \| K=2376, 50×50 \| 1.05x \| \| K=2376, 100×100 \| 1.06x \| Where wavelet does not win (25/37): 1×1 convolutions (explicit kernel does host-side transform), grouped convolutions with small per-group channels, and shapes where standard CShuffleV3 already amortizes VALU overhead. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com>	2026-05-18 17:46:01 +02:00
JH-Leon-KIM-AMD	9a5d1ea791	[rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6) [CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208) ## Motivation Improve performance of grouped convolution backward-data kernels to match non-grouped kernel performance for G=1 cases. ## Technical Details - Add NoShuffle epilogue path (direct VGPR→Global writes) by setting `CDEBlockTransferScalarPerVector_NPerBlock = 1` - Add nongrouped-match instances with optimized BBlockTransfer parameters for better thread utilization - Add packed (flat) descriptor path for G=1 2D convolutions, using simpler tensor descriptors with fewer transform layers to reduce address computation overhead in the GEMM main loop - Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX zeroing cost) ## Test Plan - Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16) - Verify correctness with existing conv bwd data tests ## Test Result \| Metric \| Before \| After \| \|--------\|--------\|-------\| \| Mean ratio (grouped/nongrouped) \| 1.159 \| 1.028 \| \| Median ratio \| 1.142 \| 1.026 \| \| Cases within 2% \| 26 (4.4%) \| 186 (31.8%) \| \| Cases >20% slower \| 188 (32%) \| 2 (0.3%) \| NoShuffle + nongrouped-match instances achieve ~2.8% average gap with non-grouped kernels (down from ~16%). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com> Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com> Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 06:49:50 -07:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Illia Silin	22b9feb40f	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings (#7111) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
Linjun-AMD	cb61576896	[rocm-libraries] ROCm/rocm-libraries#6873 (commit b61b3fb) [CK] add swiglustep_and_mul activation to gridwise_moe_gemm (#6873) Title： feat(composablekernel): add swiglustep_and_mul activation to gridwise_moe_gemm Description： ## Motivation Step-3.5-Flash uses a clamped SwiGLU activation (`swiglu_limits[43]=7`, `swiglu_limits[44]=7`) for layers 43 and 44. Without this kernel path, those layers produce BOS token spam because unclamped gate/up values accumulate floating-point noise over 200+ decode steps, degrading output quality (cosine similarity drops from 0.999989 to ~0.998982). ## Changes Add `swiglustep_and_mul` as a new `Activation` enum branch in `gridwise_moe_gemm.hpp`, covering all 4 code paths: - Quantized (A×B scale) + IsInputGemm=true - Quantized (A×B scale) + IsInputGemm=false - Non-quantized + IsInputGemm=true - Non-quantized + IsInputGemm=false The activation computes: gate = silu(gate) gate = clamp(gate, max=7.0f) up = clamp(up, min=-7.0f, max=7.0f) output = gate * up Also handles the `MulRoutedWeight` case (topk weight multiplication) and `pk_i4_t` weight scaling (×16 dequant factor). ## Verification - Tested on gfx950 (MI350X, 8×GPU) - cosine similarity for layers 43/44: 0.999989 (vs 0.998982 before fix) - End-to-end Step-3.5-Flash inference: no BOS spam, output coherent - BF16 tp=2/tp=4 and FP8 tp=2/tp=4 all verified PASS - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-07 05:59:47 +00:00
Artem Kuzmitckii	d13e674b49	[rocm-libraries] ROCm/rocm-libraries#6132 (commit e97065d) [CK] Fix divide-by-zero crash for grouped conv kernels (#6132) ## Motivation During run pytorch unit tests for conv3d: `test_dtypes_nn_functional_conv3d_cuda`, `test_fake_crossref_backward_amp_nn_functional_conv3d_cuda_float32` found divide-by-zero crash during CK kernel selection. Refs ROCM-20764 ## Technical Details Add assert for K0PerBlock equal 0, also covered other potential places related with k_batch calculation. ## Test Plan Run miopen command extracted from mentioned test: `MIOpenDriver convfp16 --spatial_dim 3 -I NCDHW -O NCDHW -f NCDHW -n 1 -c 1 -k 1 -g 1 --in_d 4 -H 4 -W 4 --fil_d 4 -y 4 -x 4 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 2 -v 2 --dilation_d 1 -l 1 -j 1 -m conv -F 4 -t 1` ## Test Result Passed ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>	2026-04-23 22:10:46 +02:00
KateJu	940c9603a3	[rocm-libraries] ROCm/rocm-libraries#6655 (commit 677b38d) Add missing lds sync (#6655) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-23 07:05:33 -07:00
jakpiase	fc39a02cda	[rocm-libraries] ROCm/rocm-libraries#6624 (commit 47d0162) [CK_TILE] Grouped Convolution Backward Data Direct Load (#6624) ## Proposed changes Add Grouped Convolution Backward Data with Direct Load into DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation. This enables direct global memory loading (bypassing LDS) for the backward data convolution path on gfx950, following the same pattern used in both backward weight and forward convolution. Direct load convolution backward data improves performance by avoiding LDS round-trips for certain configurations on gfx950, which supports a wider range of instructions. Currently correctness is checked only at usage point, but should be extended to a standalone UT in the future.	2026-04-23 11:16:55 +02:00
Illia Silin	d16061f578	[rocm-libraries] ROCm/rocm-libraries#6550 (commit c396de9) [CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
金黄色葡萄球君君	8be1bc3b1f	[rocm-libraries] ROCm/rocm-libraries#6118 (commit 2c7dcf7) projects/composablekernel: add SwigluStep support for MoE blockscale (#6118) ## Summary - add `swiglustep_and_mul` to the composablekernel MoE blockscale activation enum - implement the corresponding blockscale epilogue path for `SwigluStep` - keep existing `silu` and `gelu` paths unchanged ## Scope This PR covers the classic composablekernel blockscale MoE path under `projects/composablekernel`. This is separate from the `ck_tile` / FlatMM path being discussed in ROCm/rocm-libraries#5992. ## Motivation `Step-3.5-Flash-FP8` uses `SwigluStep` in its MoE MLP path. The dependent AITER change needs native support for this activation in the classic composablekernel MoE blockscale path. ## Validation - patch is limited to two composablekernel files under `projects/composablekernel` - existing `silu` / `gelu` paths are unchanged - dependent AITER runtime validation hit the classic CK 2-stage path with AITER MoE enabled	2026-04-21 07:24:48 +00:00
Zoltán Lakatos	09bf63fa71	[rocm-libraries] ROCm/rocm-libraries#4961 (commit 6c3969a) [CK] Remove code duplications in grouped gemm fixed nk implementations (#4961) ## Motivation Different flavours of grouped gemm fixed nk implemenations share the same block to tile mapping logic. Despite that the code responsible for it is duplicated in each device struct implementation. - Move `BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops` and `OffsettedBlockToCTileMapMLoops` from the device struct implementations to a common header file. - Use the generic Kernel Argument structures in xdl versions of the fixed nk. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan CI in general. Relevant test and examples are all fixed_nk versions of grouped gemm multiple D and ABD. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-20 12:24:59 +00:00
Ville Pietilä	c7fe8b72c6	[rocm-libraries] ROCm/rocm-libraries#6421 (commit 05b0753) [MIOpen][CK] Fix bwd weight conv test failures by disabling one block-GEMM V5 instance for 3D convs (#6421) ## Motivation Due to compiler version update, there are test failures in the test target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current develop branch, the test failures are blocking any PR merges into develop. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as the tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. First attempt to fix this problem had to be reverted in #6400 because it broke MIOpen internal DB sync tests. ## Technical Details The root cause for the test failures are the block-GEMM V5 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size. The V5 pipeline uses double register buffer that in combination with large tile size causes high register pressure. The latest version of compiler handles the register spillage incorrectly for `gfx90a`, which cause the kernel to output incorrect results. The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that do not use direct load for are divided into two groups - Base instances - Instances that result into high register usage (currently only one instance - one that causes the test failures). This division allows to disable only the V5 block-GEMM flavor of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The selective disabling leaves the set of instances for 1D and 2D convolutions unaffected, and removes at runtime two V5 block-GEMM instances (`ConvBwdWeightDefault` and `ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the device is `gfx90a`. Because MIOpen uses CK's type string (provided by method `GetTypeString`) to identify the instances, the DB sync tests are expected to unaffected since there are still the V2 block-GEMM instances that result in the same type string (`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by running the MIOpen DB sync tests that are not part of the normal CK PR build. ## Test Plan Running all CI tests + the MIOpen internal DB sync tests is sufficient to verify the correctness of the code changes. ## Test Result Verified locally that the previously failing tests `TestGroupedConvndBwdWeight3d/4.Test3D` and `TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts - 231 on `gfx90a` - 233 on `gfx942` and are currently passing. This confirms the expectation that two instances per data type should be disabled on `gfx90a`. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Ville Pietilä <>	2026-04-17 09:16:32 +03:00
Estevan Vedovelli	0121f39b1f	[rocm-libraries] ROCm/rocm-libraries#6379 (commit b38b056) [ck] Clamp negative kernel execution elapsed time to zero (#6379) ## Motivation hipEventElapsedTime can return a small negative value on Windows when timing a very fast kernel launch on the null stream. This caused consumers of launch_and_time_kernel to receive a negative elapsed time, which they reasonably treat as an error, breaking otherwise-correct kernel executions. ## Technical Details After calling hipEventElapsedTime, a clamp is applied in launch_and_time_kernel before the result is returned, avoiding the return of a physically impossible elapsed time. The negative value from hipEventElapsedTime has been observed on Windows. For kernels that complete in well under a millisecond, the HIP event timestamps can alias such that the computed difference is a small negative number (observed: ~-1.78 ms). No HIP error is reported by any surrounding call (hipEventRecord, hipEventSynchronize, hipGetLastError), confirming the kernel itself executed successfully. ## Test Plan - Recompile CK and validate no kernel execution reports a negative elapsed time during hipTensor tests. - Pass the CI/CD pre-checking tests for CK. ## Test Result - All tests passing ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 09:14:26 -07:00
Aviral Goel	c7eb33078c	[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8) CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00
Bartłomiej Kocot	dbdf0a6eca	[rocm-libraries] ROCm/rocm-libraries#6090 (commit bd5709e) [CK][CK Tile] Conv Bwd Data flush cache and profiling improvements (#6090) ## Motivation Improve accuracy of conv bwd data perf measurements ## Technical Details - enable flush cache - for grouped conv we zero conv input(gemm output) inside device op, so we also include this in time measurement - for non-grouped conv we zero conv input(gemm output) outside device op (in profile_conv_bwd_data_impl.hpp) so it is not included. - In this pr I changed it to include zeroing if time_kernel/flush cache is enabled so at now you should have more fair comparison. I changed it only for time_kernel/flush_cache because MIOpen run own zeroing for non-grouped solvers. ## Test Plan test_grouped_conv_bwd_data_* ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-04 00:22:22 +00:00
harkgill-amd	2b02deb36c	[rocm-libraries] ROCm/rocm-libraries#5141 (commit e790cc0) Add missing gfx1033 to gfx103 group definition in ck (#5141) ## Motivation Resolving PyTorch build failures when enabling builds for gfx103X-all family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033` is the only failing architecture in the family and the failures point to missing support in CK. ## Technical Details PyTorch build fails with repeated error message ``` /__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD' 33 \| wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` `gfx1033` is missing from the `__gfx103__` group which results in `CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in `gfx1033` to the missing files which should be the minimum fix to allow torch builds to pass. ## Test Plan Compile sample test file and target gfx1033 ``` ... #ifdef __HIP_DEVICE_COMPILE__ static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value"); #else static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value"); #endif ``` ## Test Result Prior to the applying patch, compilation fails with `error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'` After applying patch, test file compiles successfully. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-03 13:44:38 -06:00
Estevan Vedovelli	2303d0aee7	[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a) [CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022) ## Motivation Consumers that identify kernels by their `GetTypeString()` (such as hipTensor's actor-critic kernel selection, which hashes the string into a stable cross-platform UID) were silently dropping one of two colliding variants during registry insertion. `GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle` previously printed 13 template parameters, omitting `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. These four parameters determine the block-transfer access width and LDS padding strategy, and are precisely what differentiates the `kk`, `kn`, `mk`, and `mn` layout variants from one another when all other geometry parameters are equal. Two instantiations with identical 13-parameter strings are distinct C++ types that accept different stride layouts and reject each other's arguments via `IsSupportedArgument`. This patch extends the output to 17 parameters so that every distinct template instantiation of this class produces a unique `GetTypeString()`. ## Technical Details `include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`: - extend `GetTypeString()` from 13 to 17 parameters including `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. ## Test Plan Build CK and hipTensor with these changes, and verify hipTensor can differentiate and select the correct kernels with layout variations. ## Test Result CK is building correctly and hipTensor is selecting the kernels correctly. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 08:18:11 -07:00
Illia Silin	70e4696f01	[rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b) [CK] fix clang lifetimebound errors with staging compiler (#5921) ## Motivation The ROCm staging compiler (newer Clang) enforces `[[clang::lifetimebound]]` annotations on methods that return references or pointers to internal object data. Without these annotations, the staging compiler emits compilation errors for container accessor methods across the CK and CK Tile namespaces. ## Technical Details Adds `[[clang::lifetimebound]]` to all reference/pointer-returning accessors in core container types: `ck::` namespace: - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()` - `index_array` -- `operator[]` - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()` - `IndexLookupTable` -- `operator[]` `ck_tile::` namespace: - `array` -- `get(i)`, `at()`, `operator[]`, `operator()` - `static_array` -- `operator[]` - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()` - `make_kernel()` -- parameter pack Also removes the unused `instance_index` variable from `batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing accordingly. ## Test Plan - Compile with the staging compiler to verify all lifetimebound errors are resolved - Existing tests pass unchanged -- the attribute is a compile-time annotation with no runtime effect ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 07:19:32 -07:00
Jan Patrick Lehr	2457ee6395	[rocm-libraries] ROCm/rocm-libraries#5639 (commit a65e645) [CK] More lifetime-warning suppression (#5639) ## Motivation The staging compiler picked up another change from upstream that leads to more lifetime-analysis warnings. This breaks the build, given CK is built with -Werror. As a result, compiler promotion is blocked. ## Technical Details This patch adds the pragma push diagnostics to ignore the lifetime-warnings in the modified files to unblock compiler promotion. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-28 11:19:46 +00:00
Bartłomiej Kocot	4411c11a13	[rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1) [CK] Fix min k_batch calculation in conv kernels (#5785) ## Motivation Avoid division by 0 and remove not needed "-1". ## Technical Details Our div up implementation return lower value if input is divisible. There is no need to subtract 1. ## Test Plan test_grouped_conv_bwd_weight ## Test Result Passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1019	2026-03-27 15:37:37 +00:00
lalala-sh	32da20e02c	[rocm-libraries] ROCm/rocm-libraries#5086 (commit f4880d7) [CK] Fix MOE FP8 SplitK buffer descriptor OOB (#5086) When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces are still based on full K dimension. This causes hardware buffer resource descriptors to extend beyond the actual tensor allocation, leading to GPU memory access faults when the tensor happens to be placed at the end of an allocated memory pool region. Fix by subtracting the split offset from each buffer's element space in both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer descriptor range [shifted_base, shifted_base + reduced_space) exactly covers the valid allocation. Also refactor SplitKBatchOffset to accept const Problem& (instead of Argument&) and add a default constructor, enabling direct reuse in Run/Run_2Lds without duplicating offset calculation logic. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yi DING <yi.ding@amd.com>	2026-03-19 10:41:53 +08:00
Christopher Millette	d48567a076	[rocm-libraries] ROCm/rocm-libraries#5031 (commit 1d86a92) [CK] Replace nested static_for with static_ford to reduce device IR function emissions [1B] (#5031) ## Summary ### Rationale CK's GPU kernels are among the slowest files in the ROCm build, with a single translation unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified nested `static_for` loops as the root cause: each nesting level multiplies the number of unique lambda IR functions the compiler must process. A 2-level nest of `static_for<0, M, 1>` / `static_for<0, N, 1>` produces M×N unique lambda types. With typical GEMM dimensions (M=16, N=4), a single nest generates 64 unique functions — and these nests appear hundreds of times across the codebase. The LLVM backend's CGSCC (Call Graph Strongly Connected Components) framework processes each function independently, so reducing function count directly reduces backend time. ### What changed 393 nested compile-time loop patterns across 73 files are converted to `static_ford`, which flattens multi-dimensional compile-time iteration into a single `static_for` with index decomposition. This eliminates 994 `static_for` nesting levels (42% reduction). Three pattern categories were converted: - Category A: `static_for` wrapping `static_ford` — fold outer dimension into ford - Category B: nested `static_ford` — merge into single higher-dimensional ford - Category C: nested `static_for` chains — convert to single `static_ford` ### Verification ASM equivalence: PASS — 51/51 device assembly files identical (gfx942 + gfx1100) \| Architecture \| Files compared \| Largest file \| Result \| \|---\|---\|---\|---\| \| gfx942 \| 36 \| 386,685 lines \| ALL MATCH \| \| gfx1100 \| 15 \| 47,769 lines \| ALL MATCH \| Build time (Wilcoxon signed-rank test, 7 paired trials): \| Target \| Pre (s) \| Post (s) \| Delta \| p-value \| \|---\|---\|---\|---\|---\| \| bscale \| 169 \| 152 \| -9.8% \| 0.016 \* \| \| xdl_v1234 \| 207 \| 194 \| -6.6% \| 0.016 \* \| \| preshuffle \| 275 \| 264 \| -3.9% \| 0.016 \* \| \| xdl_base \| 142 \| 137 \| -3.2% \| 0.031 \* \| IR function counts (device backend, gfx942): \| Target \| InstFunc Δ \| CodeGen Δ \| Compiler Δ \| \|---\|---\|---\|---\| \| bscale \| -13,043 (-8.2%) \| -2,103 (-3.5%) \| -10.7% \| \| xdl_v1234 \| -9,431 (-5.7%) \| +59 (+0.1%) \| -5.2% \| \| xdl_base \| -6,162 (-4.9%) \| -1,141 (-2.5%) \| -2.2% \| \| xdl_old \| -3,234 (-3.7%) \| -963 (-8.7%) \| -3.3% \| ### Value - 994 fewer `static_for` nesting levels (-42%) across 73 files - 393 `static_ford` sites created (from 4 pre-existing) - Up to 9.8% compile-time reduction on representative targets (statistically significant, p < 0.05) - Up to 13K fewer IR function instantiations per translation unit - Net -849 LOC from reduced indentation - Zero ASM changes — identical device code output verified on gfx942 and gfx1100 - All scheduling barriers, `if constexpr` guards, and MFMA/WMMA accumulation order preserved ### Files changed (73) - `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle, blockscale variants) - `grid/`: 20 files (softmax, normalization, reduction, attention, layernorm) - `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops, reduction) - `tensor_description/`: 1 file (tensor_adaptor) ## Test plan - [x] `static_ford` tested with 21 unit tests in `test/util/unit_ford.cpp` (1D-4D, custom orders, compile-time verification) - [x] All conversions preserve iteration order, `block_sync_lds()` placement, `if constexpr` scheduling guards, and MFMA/WMMA accumulation order - [x] ASM equivalence verified: 51 device `.s` files across gfx942 + gfx1100 - [x] Build-time improvement statistically confirmed (Wilcoxon, p < 0.05, 4 targets) - [x] IR function count reduction confirmed via `-ftime-trace` on 7 targets - [x] Detection script reports 0 remaining safe patterns (180 blocked with structural reasons) - [x] Existing CI tests (GEMM, softmax, normalization, batch norm, reduction, attention) exercise all converted code paths ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 08:45:22 -06:00
Bartłomiej Kocot	b61cf917e3	[rocm-libraries] ROCm/rocm-libraries#5454 (commit 8dade31) [CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454) ## Motivation Flush cache to get more stable results during profiling old ck and ck tile. ## Technical Details Flush cache before each kernel call and one more first run. ## Test Plan test_grouped_conv_bwd_weight_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-966 --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 17:46:21 +00:00
lalala-sh	7b61c8a0b6	[rocm-libraries] ROCm/rocm-libraries#5225 (commit 880166b) [CK] fix moe memset size which is bigger than alloc (#5225) ## Motivation Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that crashes split-K MOE GEMM with "HIP runtime error: invalid argument". When KBatch > 1, the invoker zeroes the output buffer using arg.M * arg.N as the byte count. However, arg.M is the padded sorted-token-id length from MOE routing, which can be much larger than the actual output allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write beyond the buffer, and the silently-swallowed HIP error propagates to the subsequent kernel launch via hipGetLastError(). This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset matches the actual output size. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 17:30:07 +08:00
Bartłomiej Kocot	3741885b52	[rocm-libraries] ROCm/rocm-libraries#5114 (commit 59b8cb5) [CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114) ## Motivation Improve profiling for grouped convolution forward for better comparison between CK and CK Tile ## Technical Details - Include preprocessing time for ck tile - Add flush cache for conv fwd profiler - Switch configs to builder reflect - Add KPerXdl deduce - Add non-grouped ported instances ## Test Plan test_grouped_convnd_fwd_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-786	2026-03-11 23:38:15 +01:00
JP-Fernando	96948004ba	[rocm-libraries] ROCm/rocm-libraries#4421 (commit 5bb5769) [CK] Unify the grouped convolution gridwise Run() functions (#4421) ## Motivation There are currently three different grouped convolution related Run() function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`. These are used for the different types of grouped convolution: Forward, Backward weights, and Backward data. The functions are very similar and should be unified to a single `Run()` function for all types of grouped convolution. ## Technical Details The three old `Run<>()` functions were replaced with a single unified function. The new `Run<>()` function is run from device implementations: - DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 - DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor implementation uses a different `Run<>()` overload and was therefore not modified. ## Test Plan Run the following grouped convolution tests on `gfx1201`, as this architecture is WMMA-capable: - `test_grouped_convnd_fwd` - `test_grouped_convnd_bwd_weight` - `test_grouped_convnd_bwd_data` Compilation and testing were also executed on `gfx1100` to avoid CI problems. ## Test Result First part (unification of `Run<>()` function): All tests successful. Second part (integration of single `Run<>()` function as a direct call): All tests successful. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-03-11 17:38:55 +01:00
John Shumway	2da1f39a9e	[rocm-libraries] ROCm/rocm-libraries#5284 (commit 76b5b15) [CK_BUILDER] Add DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284) Add factory, InstanceTraits, and conv traits support for the WMMA V3 forward convolution kernel, enabling the CK Builder to generate and dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs. ## Motivation As reported in issue #4944, MIOpen includes WMMA V3 forward convolution kernels, so this PR adds support for those kernels similarly to other supported kernels. ## Technical Details This follows the same implementation as the other kernels. I added some support for reflection, but I left a few todos since we need to generalize our convolution traits to generalize across WMMA/MFMA and CK/CKTile. ## Test Plan Added faster tests to `ninja smoke-builder` that check the instance-traits logic, and I added longer tests that instantiate kernels, following the existing pattern in other kernals. ## Test Result I tested all code with `ninja check-builder` on a gfx1101 build and ran on gfx1101. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 16:41:51 -07:00
Max Podkorytov	94d5eb4f13	[rocm-libraries] ROCm/rocm-libraries#5041 (commit 481aecc) [CK] Precompute SpaceFillingCurve indices to reduce compile time by 31% (#5041) ## Summary Optimize `SpaceFillingCurve` in CK to reduce compile time by precomputing all index values into a static constexpr lookup table. ### Problem - `GetIndex<N>` was instantiated separately for every index value (0 to NumAccesses-1) - Each instantiation triggered nested `static_for` loops with O(N²) template depth - This caused 34,000+ template instantiations taking 69 seconds in frontend ### Solution - Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed indices - Add `compute_single_index()` helper using O(N) `static_for` loops - Add `compute_all_indices()` to build entire table in one constexpr evaluation - `GetIndex<N>` becomes simple array lookup: `return index_table[N]` ### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp) \| Metric \| Before \| After \| Improvement \| \|--------\|--------\|-------\|-------------\| \| Total compile time \| 120.4s \| 83.6s \| -31% \| \| Frontend time \| 88.7s \| 52.6s \| -41% \| \| GetIndex instantiations \| 34,176 \| 384 \| -99% \| \| GetIndex time \| 69.0s \| 0.11s \| -99.8% \| \| SpaceFillingCurve time \| 75.7s \| 4.3s \| -94% \| ## Test plan - [x] Builds successfully with `-Werror -Weverything` - [ ] Run existing unit tests - [ ] Verify numerical correctness on sample kernels 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2026-03-10 12:40:08 -07:00
Márton Bidlek	dfa680345f	[rocm-libraries] ROCm/rocm-libraries#5135 (commit 5ccc138) Proof of concept for removing forward declarations (#5135) ## Motivation Currently, we forward declare CK device operation templates in CK-Builder's reflection code: `9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)` This is mainly required to break a circular dependency in reflection. The architecture of that is as follows: MyDeviceOp implements GetInstanceString(). This is typically defined directly in the class definition (no forward declaration). GetInstanceString() calls instance_string<MyDeviceOp>() instance_string<MyDeviceOp>() calls InstanceTraits<MyDeviceOp>::instance_string() InstanceTraits has a specialization for MyDeviceOp which implements instance_string() So order for GetInstanceString() to work properly, InstanceTraits must already be defined. And for InstanceTraits to be defined, the device op needs to be defined. In order to do that, we are currently using aforementioned forward declaration. ## Technical Details C++'s lazy template evaluation is used by calling into an as-of-yet undefined function static member function of `InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then specializing `InstanceTraits` only _after that_. The caveat here is that both the device op itself as well as the instance traits specialization must be in scope, otherwise there would be an undefined function error. In practise, we can solve that either by placing the instance traits directly into the file that defines `MyDeviceOp`, or possibly by using a `.inc` file to keep the concerns separated. ## Test Plan The results were verified by running the existing regression tests for CK Builder ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>	2026-03-09 09:34:18 -07:00
Christopher Millette	279e6d0794	[rocm-libraries] ROCm/rocm-libraries#4673 (commit ec385da) Compile-time optimize threadwise slice transfer (#4673) ## Motivation Profiling with `-ftime-trace` on representative translation units (e.g., `device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`) revealed that 92% of frontend time was spent in template instantiation. The primary bottleneck was redundant instantiation of identical helper logic across multiple threadwise transfer class variants. Each `ThreadwiseTensorSliceTransfer_v` class independently contained its own copy of the same helper computations — serpentine traversal, coordinate stepping, thread scratch descriptors, lambda-like functors, and compile-time constants — duplicated across 13 header files. When a typical GEMM or convolution kernel TU includes blockwise operations (e.g., `blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants simultaneously, causing the compiler to instantiate the same helper logic multiple times with the same template arguments. This was compounded by the helpers being defined as members of the outer `ThreadwiseTensorSliceTransfer_v` classes, which carry 14+ template parameters. Functions like `ComputeForwardSweep` depend only on their two argument types, but as inline members of the outer class, the compiler was forced to create separate instantiations for every unique combination of all outer parameters (data types, descriptors, vector widths, etc.) — even when most of those parameters had no effect on the helper's output. ## Technical Details ### The Fix: Shared Helper Struct Hierarchy Duplicated logic was extracted into a standalone helper hierarchy in `threadwise_tensor_slice_transfer_util.hpp`: ``` ThreadwiseTransferHelper_Base (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor, \| ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple) +-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex, \| ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable) +-- ThreadwiseTransferHelper_SFC (ComputeSFCCoordinateResetStep) ``` Each helper method is now parameterized only by what it actually uses: - `ComputeForwardSweep(idx, lengths)` — parameterized only by the two argument types, not by `SrcData`, `DstData`, `SrcDesc`, etc. - `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by the descriptor and access sequence types. - `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector, DimAccessOrder>()` — parameterized only by the four values it actually needs. This reduces template instantiation work in two ways: 1. Across different transfer variants (v3r1 vs v3r2 vs v3r1_gather): the compiler reuses a single instantiation instead of creating one per variant. 2. Across different outer class instantiations (fp16 vs bf16 vs int8): the compiler reuses the helper instantiation because the helper doesn't depend on the data type at all. ### Refactored Headers 13 headers now delegate to the shared helpers instead of duplicating logic: - Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant - SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter - Dead code removed: v4r1, v5r1 ### Additional Fixes Found During Refactoring - Two latent bugs in v3r2 (`forward_sweep` indexing, `GetDstCoordinateResetStep` extraction) - Dead `SrcCoordStep` variables in v4r1 and v5r1 - Unused `scale_element_op_` member in v3r1_dequant (restored with note) ### Net Code Change +1,428 / -2,297 lines (~870 lines removed). ## Test Plan ### Unit Tests 28 host-side gtests in `test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp` covering the full helper hierarchy: \| Suite \| Tests \| What is verified \| \|-------\|-------\|------------------\| \| ThreadwiseTransferHelperBase \| 6 \| Compile-time constants, inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false in 2D and 3D \| \| ThreadwiseTransferHelperSerpentine \| 9 \| `ComputeForwardSweep` (even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete), `ComputeDataIndex`, `ComputeCoordinateResetStep`, `VectorSizeLookupTable`, `VectorOffsetsLookupTable` \| \| ThreadwiseTransferHelperSFC \| 6 \| `ComputeSFCCoordinateResetStep` — single access, 2D row-major, 2D column-major, 3D batch, even/odd inner access counts \| \| ThreadwiseTransferHelperInheritance \| 3 \| Serpentine and SFC derive from Base, are not related to each other \| \| DetailFunctors \| 4 \| `lambda_scalar_per_access`, `lambda_scalar_step_in_vector`, `lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) \| ### Semantic Equivalence GPU ISA comparison using `--cuda-device-only -S` confirmed identical assembly output (modulo `__hip_cuid_` metadata) between baseline and refactored code. ## Test Results All measurements on a 384-core machine, `-j64`, freshly rebooted, near-idle. ### Targeted Builds (affected targets only) \| Target \| Baseline \| Refactored \| Wall-clock Delta \| CPU Delta \| \|--------\|----------\|------------\|-----------------\|-----------\| \| `device_grouped_conv2d_fwd_instance` (160 TUs) \| 7m 37s / 189m CPU \| 6m 53s / 161m CPU \| -9.7%* \| -14.9% \| \| `device_grouped_conv3d_fwd_instance` (185 TUs) \| 9m 49s / 202m CPU \| 6m 42s / 182m CPU \| -31.8% \| -10.0% \| \| Combined \| 17m 27s / 392m CPU \| 13m 35s / 344m CPU \| -22.2% \| -12.4% \| ### Full Project Build (8,243 targets) \| Metric \| Baseline \| Refactored \| Delta \| \|--------\|----------\|------------\|-------\| \| Wall-clock \| 103m 38s \| 111m 56s \| +8.0%* \| \| CPU time \| 4705m 7s \| 4648m 17s \| -1.2% \| \*Wall-clock inflated by external load spike during refactored build (load 90 vs 66). CPU time is the reliable metric. ### Context ~15% of all build targets (1,262 / 8,243) transitively include the modified headers. These are primarily GEMM and convolution kernel instantiations — the core compute workloads. The 12-15% CPU savings on affected targets is diluted to 1.2% across the full project because 85% of targets are unaffected. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 09:26:40 -07:00
lalala-sh	df806ca2f6	[rocm-libraries] ROCm/rocm-libraries#5094 (commit d4548e6) [CK] use int64 for ptr offset (#5094) ## Motivation When the number of experts (E) is large (e.g., E=257 in DeepSeek-V3), the `expert_id * expert_stride` calculation in MOE GEMM kernels overflows `int32` (`index_t`), causing the weight matrix (B) pointer to wrap to an invalid address and triggering a GPU memory access fault. For example, with `N=1024, K=7168, IsInputGemm=true`: - `expert_stride = N * K * 2 = 14,680,064` - `INT32_MAX / expert_stride ≈ 146` - Any `expert_id >= 147` causes overflow → negative offset → illegal memory access → GPU crash ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: amd-shiraz <shiraz.ali@amd.com>	2026-03-05 18:00:01 -08:00
Max Podkorytov	41070044bd	[rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb) Add generate_identity_sequences helper and replace lambdas with named functors (#4828) ## Summary - Add `generate_identity_sequences<N>()` helper that returns `Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>` - Replace lambdas with named functors in `transform_tensor_descriptor` - Add `unpack_and_merge_sequences` helper functor - Reduces `transform_tensor_descriptor` instantiations from 388 to 32 (92% reduction) ## Motivation Multiple call sites use `generate_tuple([](auto i) { return Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda instantiations. Additionally, each lambda in `transform_tensor_descriptor` creates a unique closure type, causing the function to be instantiated separately for every call site. Named functors share a single type, so the compiler reuses the same instantiation. ## Changes ### Part 1: generate_identity_sequences helper - Replaces common lambda pattern for generating identity sequences - Each lambda expression creates a unique closure type, causing separate template instantiations at every call site - Named helper shares a single type across all uses ### Part 2: Named functors in transform_tensor_descriptor - Add `unpack_and_merge_sequences` helper to replace lambda in `GetNumOfHiddenDimension` - Use `generate_identity_sequences` in `matrix_padder.hpp` ## Test Plan - [x] Added 7 unit tests: - 4 tests for `generate_identity_sequences` - 3 tests for `unpack_and_merge_sequences` - [ ] Waiting for full CI ## Related PRs This PR merges the functionality from: - ROCm/composable_kernel#3588 (generate_identity_sequences helper) - ROCm/composable_kernel#3589 (Named functors in transform_tensor_descriptor) Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times) Note: This PR supersedes #4283, ROCm/composable_kernel#3588 and ROCm/composable_kernel#3589, which can be closed once this is merged. --- 🔁 Imported from [ROCm/composable_kernel#3628](https://github.com/ROCm/composable_kernel/pull/3628) 🧑‍💻 Originally authored by @tenpercent Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 12:10:11 -08:00
Bartłomiej Kocot	a67aaa1b96	[rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2) [CK] Port non-grouped convolution instances to the grouped kernels (#4875) ## Motivation Port non-grouped convolution instances to the grouped kernels to deprecated older non-grouped implementations. ## Technical Details Add the same instances as non-grouped but using grouped kernel. ## Test Plan test_grouped_convnd_fwd ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-724	2026-02-28 01:24:30 +00:00
kabrahamAMD	4cd4f0adee	[rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d) [CK_Builder] added bwd data kernels to builder factory (#4582) This PR adds bwd data wmma and xdl kernels to the ck builder, their instance and conv traits as well as tests for the above. --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> Co-authored-by: John Shumway <jshumway@amd.com>	2026-02-27 03:05:38 +00:00
Yung-sheng Tu	743552b6fd	[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f) Implement device_grouped_gemm_fixed_nk_bias for RDNA4 (#4340) ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-26 00:28:09 +00:00
Zoltán Lakatos	cb60fdd58d	[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f) [CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425) ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-25 05:16:07 +00:00
Illia Silin	c7c5a018ed	[rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5) Revert "[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3" (#4762) Reverts ROCm/rocm-libraries#4638 unfortunately, this PR interfered with the PR#4299 and caused build errors for gfx11: In file included from /rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7: In file included from /rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11: /rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21: error: no matching function for call to 'CheckValidity' 553 \| if(!GridwiseGemm::CheckValidity( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~	2026-02-20 22:40:28 +00:00

1 2 3 4 5 ...

883 Commits