[CK] Fix gfx950 AITER Sync Regressions
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Fixes three gfx950 regressions in the AITER downstream CI that surfaced
after the internal/gfx1250 re-sync (ROCm/rocm-libraries#6978):
> **Companion aiter PR:** ROCm/aiter#3392 — host-side adaptations
(`Kernel::BlockSize()` `constexpr` drops, blockscale `KBatch=1` clamp)
plus the CK submodule bump used to validate these fixes together.
- **FlyDSL MoE AOT cache miss** — the AITER MoE tests run with
`check_aot_cache=True` and fail on any FlyDSL JIT cache miss, but the CI
never pre-compiles the FlyDSL MoE kernels, so gfx950 always misses.
Pre-compile them at the start of the AITER test stage.
- **`buffer.load.lds.v4i32` link error** — ROCm/rocm-libraries#6978
reintroduced a clang-version guard mapping
`llvm.amdgcn.raw.buffer.load.lds` to a `.v4i32`-suffixed name. That name
exists in no LLVM (the rsrc operand is a fixed, non-overloaded `<4 x
i32>`, so the intrinsic is never type-mangled), so gfx950 4-DWORD
direct-to-LDS (e.g. fp4 MoE bpreshuffle) fails to link with `lld:
undefined symbol: llvm.amdgcn.raw.buffer.load.lds.v4i32`. Use the
canonical plain name unconditionally.
- **mixed-precision flatmm warp-GEMM call** — ROCm/rocm-libraries#6978
generalized the scaled `WarpGemmImpl::operator()` from a fixed `<index_t
opselA, index_t opselB>` signature to a variadic `<typename... Params>`
one and updated the `mx_flatmm` pipeline to pass the op-selectors as
`OpSelA<>`/`OpSelB<>` types, but missed the mixed-precision flatmm
pipeline (`F8xMXF4`/`F16xMXF4`), which still passed raw integer
op-selectors. These no longer bind to `typename... Params` (`error: no
matching member function for call to 'operator()'`), breaking
compilation of the fp8/bf16 × fp4 cktile MoE gemm1 instances on gfx950
(aiter `test_moe_2stage`). Wrap the op-selectors in
`OpSelA<>`/`OpSelB<>`.
## Changes
- `Jenkinsfile`: pre-compile the FlyDSL MoE AOT cache (`python3
aiter/aot/flydsl/moe.py`) before the AITER tests.
- `include/ck/utility/amd_buffer_addressing_builtins.hpp` and
`include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp`: drop the
`__clang_major__` guard and always use
`__asm("llvm.amdgcn.raw.buffer.load.lds")`. The plain name is the
canonical one for all sizes including the gfx950 16-byte form, as the
upstream LLVM gfx950 tests confirm.
-
`include/ck_tile/ops/flatmm/pipeline/mixed_prec_flatmm_pipeline_agmem_bgmem_creg_v1.hpp`:
wrap the warp-GEMM op-selectors in `OpSelA<>`/`OpSelB<>` at the five
call sites, matching the `mx_flatmm` pipeline.
## Test plan
Validated via CI.
Replace nested static_for lambdas with compile-time search
helper (#6696)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
- Add `sequence_find_value` and `find_in_tuple_of_sequences`
compile-time search helpers with O(1) template depth
- Replace nested `static_for` lambdas in
`TensorDescriptor::GetTransformAndItsUpperDimension` and
`InitializeElementSize`
- Apply same optimizations to `TensorAdaptor`
Supersedes #4287. Conflict-resolved rebase of
ROCm/composable_kernel#3600 onto current develop.
## Motivation
The `TensorDescriptor` and `TensorAdaptor` classes had excessive
template instantiation from:
1. Nested `static_for` loops with lambdas creating unique closure types
at every call site
2. `generate_tuple` with lambdas causing per-type instantiation overhead
The new helpers use constexpr array lookup and pack expansion instead of
recursive template patterns, achieving O(1) template depth.
## Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved,
`-j1`, `-ftime-trace`)
| TU | Baseline (mean) | New (mean) | Delta | Wilcoxon p | Mann-Whitney
p |
|----|-----------------|------------|-------|-----------|---------------|
| `grouped_conv_fwd_xdl_fp16` (host) | 14,886 ms | 13,353 ms |
**-10.3%** | **0.002** | **0.0002** |
| `grouped_conv_fwd_xdl_fp16` (device) | 27,762 ms | 25,629 ms |
**-7.7%** | **0.002** | **0.0002** |
| **Total (all TUs)** | **57,732 ms** | **54,030 ms** | **-6.4%** | | |
Unrelated TUs (`device_memory`, `host_tensor`, `convolution_parameter`)
show no significant difference (p > 0.3), serving as negative controls.
### Methodology
- 10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the
same node to eliminate ordering/warmup bias
- Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U
test (unpaired)
- Built with patched clang (LLVM 22) on ctr2-alola-compile-11, `-j1` for
accurate per-TU timing
- Raw data available in Slurm job 275230 results
## Test plan
- [x] 11 unit tests added (5 for `sequence_find_value`, 6 for
`find_in_tuple_of_sequences`)
- [x] Compile-time benchmark with statistical significance (p < 0.01)
- [ ] Full CI
Tracking issue: #4229
[CK] Allow skipping split-K C-buffer zero-init in
xdl_cshuffle blockscale GEMM (#7935)
Add a `skip_zero_init` flag (default false) to the Problem/Argument of
the xdl_cshuffle block-scale GEMM device ops (multiple_d ab_scale and
blockscale b-preshuffle). When the flag is set, the device invoker skips
the internal hipMemsetAsync that zeroes p_c_grid before the KBatch > 1
split-K atomic-accumulation path. The flag is declared on the gridwise
Problem struct (inherited by Argument), so it is visible on both the
rotating-cache (arg_) and the normal (arg) launch paths in each device
op.
Why: callers that already pre-zero the output buffer otherwise pay for a
redundant device-wide memset before split-K atomic accumulation. Gating
the memset behind an opt-in flag lets such callers avoid the duplicate
work. Because the flag defaults to false, every existing call site is
unaffected and the observable behavior is unchanged.
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Cursor <cursoragent@cursor.com>
[CK] Remove IsPackedTensor function
## Motivation
Fix codegen hipRTC
## Technical Details
Remove not needed function. Since MakeArgument supports long_index_t
strides.
## Test Plan
Codegen tests.
## Test Result
Passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Large tensor gemm workaround (#6761)
## Motivation
Customer qeruested large tensor gemm support for 8bit and 4bit data
types. Currently CK triggers “This GEMM not supported” error. The root
cause appears to be the 2 GB limit on the input/output matrix, triggered
by buffer offset constraints when testing a larger shape such as M =
699,904 (which is an exact multiple of MPerBlock = 256).
## Technical Details
Quick workaround to have support ASAP. Split the tensors into inputs /
outputs smaller than 2GB limit. Iterate on host and call all subproblems
without device code change.
Support is restricted to rowise layout in A, Ds and E
All changes were implemented in DeviceGemm structures to avoid secondory
affect on grouped convolutions.
Got lots of AI generated comments. Addressed the ones that seemed
relevant on the functionality.
## Test Plan
Within CK the following examples can be used with modified input sizes:
example_gemm_multiply_multiply_xdl_fp8
example_gemm_mx_fp4
Tested with Aiter tuning on provided shapes.
## Test Result
All gemms run and provide correct results.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
[CK] Fix and expand CK's commit records in version.h (#6900)
## Motivation
In `version.h` of a CK installation `CK_COMMIT_ID` would be empty for
out-of-source builds.
Additionally, if it worked, it would show the parent repo's
(`rocm-libraries`) commit.
## Technical Details
Dropped "required" constraint so "unknown" string becomes a graceful
option.
Changed process of determining the CK commit, now uses
`WORKING_DIRECTORY`.
Thus, `CK_COMMIT_ID` holds only the last CK-relevant commit.
Added `CK_PARENT_COMMIT_ID` which holds the parent's, e.g.
`rocm-libraries`, commit.
This can be the same as `CK_COMMIT_ID`, or not even applicable,
depending on the scenario.
## Test Plan
Ran CMake configuration and installation of CK to verify happy path.
## Test Result
Commit SHA's showed the expected values depending on the repo state.
## Submission Checklist
- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] suppress compiler warnings while building pytorch. (#7760)
## Motivation
Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Grouped Convolution Global Load/Store support (#7631)
## Motivation
Grouped Convolution Global Load/Store support to cover large tensor
cases.
## Technical Details
Utilize global load for grouped convolution forwad kernels. Update
Indexes to use int64.
## Test Plan
- test utils
- test conv kernels in next pr with instances
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1255
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
[CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732)
## Motivation
Fix silent miscompute in the grouped convolution backward-data kernel
(`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride >
dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast
path that dropped all but the first sub-GEMM, producing zeroed slices of
`dx` on
the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness
without giving up the perf gains PR #6208 delivered on stride=1 shapes.
## Technical Details
- Tighten the flat-descriptor fast-path gate to require
`arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its
original purpose). For stride > 1, the implicit GEMM is split into
`gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly;
routing them through the flat path required dropping all but the first,
which was the source of the bug.
- Stride > 1 now falls through to the existing grouped CShuffle path,
which packs all sub-GEMMs into one descriptor array and walks them
on-device in a single kernel launch. This is the pre-PR-6208 production
path; correctness is established and per-dispatch launch count is
minimised.
- Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0)
intersection in
`test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with
`gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this
intersection (all stride>1 cases used G=2; all G=1 cases used stride=1),
which is why PR #6208's regression slipped past CI.
## Test Plan
- `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke
tier — runs on every PR via `smart_build_and_test.sh`).
- End-to-end verify (`verify=1`) via
`example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes
including the original ALMIOPEN-1959 case and a cross-bucket
(`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32`
buckets.
- ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate
via an environment variable: full kernel-family enumeration, winning
kernel + its avg_time reported under each gate. 33/41 shapes completed
before the sweep was stopped; the remaining 8 were the largest
i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape
enumeration budget (not relevant to the verdict).
## Test Result
### Correctness
| Test | Result |
|---|:---:|
| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D,
includes 3 new regression shapes) | **12/12 PASSED** in 14.18 s |
| `test_grouped_convnd_bwd_data_interface` (API checks) | **PASSED** in
0.28 s |
| ALMIOPEN-1959 stride=2 (`verify=1`) | **PASSED** |
| stride=1 K3 (`verify=1`) | **PASSED** |
| stride=3 K3 `gemms_count=9` (`verify=1`) | **PASSED** |
| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) | **PASSED** |
### Performance (ckProfiler A/B on gfx942 / MI300X)
Comparing the **post-fix gate** (flat path only when `gemms_count_==1`,
column "B") vs the **inner-loop variant** that keeps the flat path on
stride>1 (column "A") across 25 stride>1 shapes where production picks
a `_v1` instance (so the gate actually fires):
| Stride | Shapes | A wins | Tie | B wins | Notes |
|:------:|:------:|:------:|:---:|:------:|---|
| 1 (sanity, gate moot) | 3 | 0 | 3 | 0 | gate doesn't differentiate — A
== B as expected |
| > 1 (gate fires) | 25 | **0** | 11 | **14** | B wins +6% to +32%; A
never wins |
Highlights from the firing-gate cases:
| Shape (G=1, stride=2 unless noted) | A ms | B ms | B vs A |
|---|---:|---:|---:|
| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) | 0.183 | 0.171 | **B
+6%** |
| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) | 0.054 | 0.045 | **B +17%**
|
| i2v-010 (N=1, K=C=384, 3×3, 277×209) | 0.174 | 0.125 | **B +28%** |
| Synthetic 50×50 K3 N=32 K=C=256 | 0.131 | 0.088 | **B +32%** |
Why B wins everywhere the gate fires: for `gemms_count = N`, the flat
path needs N kernel launches (one per sub-GEMM), while the grouped path
loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) ×
launch-tax is a structural disadvantage A can't recover from.
### Diff
| File | Lines |
|---|---:|
|
`include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp`
| +14 / −8 (one extra condition + expanded dispatch comment) |
| `test/.../test_grouped_convnd_bwd_data.cpp` | +9 / −0 (3 new shapes) |
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Decouple EpilogueArgs from GridwiseGemm implementation (#4871)
This is duplicate of #4537. I could not re-open it since te target
branch got deleted and could not change the target branch since it was
closed... :)
## Motivation
Right now, all the Epilogues structs are declared inside the base
gridwise struct. They should be independent of it and the specialization
of the selected Epilogue Type should be declared within the the kernel
function.
## Technical Details
All Epilogue structs depend on template parameters that are known to the
base Gridwise Gemm struct. In this PR, we export them to be used
independently by any struct that might need to extract them. This
approach will serve the decoupling purposes for the Epilogues, but also
enable future constructs to use and expand this approach. See
30e2a4c01b64bdea68857c7badd9d7cffbf1adb9.
Right now an issue that arises is that when implementing a new Epilogue
Type, the developer is not forced to decide where this struct should/can
be used or not. To fix this I propose defining an `enum struct
EpilogueType` that will be used to fetch the Epilogue specialization
through a helper struct. See a943ac8d130e12d6843715b322181186e54ba15c.
Note that all the instantiation details will stay in this helper struct.
Also note the static assertion in the else statement.
## Test Plan
Test with existing CI, as nothing is added/removed.
## Test Result
All relevant existing CI tests should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664)
## Motivation
Incorrect results has been introduced for some conv bwd cases.
## Technical Details
This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3.
## Test Plan
CI
## Test Result
Pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ALMIOPEN-1959
[CK] upgrade CI to rocm7.13 as default compiler (#7612)
## Motivation
Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
[CK] add composable kernel support on gfx1250 (#6978)
## Motivation
Add composable kernel support on gfx1250.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
[CK] Suppress new staging compiler errors (#7384)
## Motivation
This should make new builds with staging compiler pass.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix latest batch of staging compiler warnings (#7111)
## Motivation
Suppress the new batch of clang lifetimebound and invalidation warnings
with the latest staging compiler.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Add missing lds sync (#6655)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Grouped Convolution Backward Data Direct Load (#6624)
## Proposed changes
Add Grouped Convolution Backward Data with Direct Load into
DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation.
This enables direct global memory loading (bypassing LDS) for the
backward data convolution path on gfx950, following the same pattern
used in both backward weight and forward convolution.
Direct load convolution backward data improves performance by avoiding
LDS round-trips for certain configurations on gfx950, which supports a
wider range of instructions. Currently correctness is checked only at
usage point, but should be extended to a standalone UT in the future.
[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550)
## Motivation
New changes from upstream llvm-project cause an avalanche of warnings in
CK. Gonna disable them by ignoring the
lifetime-safety-intra-tu-suggestions flag until a better permanent
solution is found.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
projects/composablekernel: add SwigluStep support for MoE blockscale (#6118)
## Summary
- add `swiglustep_and_mul` to the composablekernel MoE blockscale
activation enum
- implement the corresponding blockscale epilogue path for `SwigluStep`
- keep existing `silu` and `gelu` paths unchanged
## Scope
This PR covers the classic composablekernel blockscale MoE path under
`projects/composablekernel`.
This is separate from the `ck_tile` / FlatMM path being discussed in
ROCm/rocm-libraries#5992.
## Motivation
`Step-3.5-Flash-FP8` uses `SwigluStep` in its MoE MLP path. The
dependent AITER change needs native support for this activation in the
classic composablekernel MoE blockscale path.
## Validation
- patch is limited to two composablekernel files under
`projects/composablekernel`
- existing `silu` / `gelu` paths are unchanged
- dependent AITER runtime validation hit the classic CK 2-stage path
with AITER MoE enabled
[CK] Remove code duplications in grouped gemm fixed nk implementations (#4961)
## Motivation
Different flavours of grouped gemm fixed nk implemenations share the
same block to tile mapping logic. Despite that the code responsible for
it is duplicated in each device struct implementation.
- Move `BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops` and
`OffsettedBlockToCTileMapMLoops` from the device struct implementations
to a common header file.
- Use the generic Kernel Argument structures in xdl versions of the
fixed nk.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
CI in general. Relevant test and examples are all fixed_nk versions of
grouped gemm multiple D and ABD.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[MIOpen][CK] Fix bwd weight conv test failures by disabling one block-GEMM V5 instance for 3D convs (#6421)
## Motivation
Due to compiler version update, there are test failures in the test
target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current develop branch, the test
failures are blocking any PR merges into develop. An example of a failed
CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as the tests are passing for clang compiler version 20.0 and
failing for clang compiler version 22.0.
First attempt to fix this problem had to be reverted in #6400 because it
broke MIOpen internal DB sync tests.
## Technical Details
The root cause for the test failures are the block-GEMM V5 instances of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size.
The V5 pipeline uses double register buffer that in combination with
large tile size causes high register pressure. The latest version of
compiler handles the register spillage incorrectly for `gfx90a`, which
cause the kernel to output incorrect results.
The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3`
that do not use direct load for are divided into two groups
- Base instances
- Instances that result into high register usage (currently only one
instance - one that causes the test failures).
This division allows to disable only the V5 block-GEMM flavor of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The
selective disabling leaves the set of instances for 1D and 2D
convolutions unaffected, and removes at runtime two V5 block-GEMM
instances (`ConvBwdWeightDefault` and
`ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the
device is `gfx90a`.
Because MIOpen uses CK's type string (provided by method
`GetTypeString`) to identify the instances, the DB sync tests are
expected to unaffected since there are still the V2 block-GEMM instances
that result in the same type string
(`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by
running the MIOpen DB sync tests that are not part of the normal CK PR
build.
## Test Plan
Running all CI tests + the MIOpen internal DB sync tests is sufficient
to verify the correctness of the code changes.
## Test Result
Verified locally that the previously failing tests
`TestGroupedConvndBwdWeight3d/4.Test3D` and
`TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts
- 231 on `gfx90a`
- 233 on `gfx942`
and are currently passing. This confirms the expectation that two
instances per data type should be disabled on `gfx90a`.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Ville Pietilä <>
[ck] Clamp negative kernel execution elapsed time to zero (#6379)
## Motivation
hipEventElapsedTime can return a small negative value on Windows when
timing a very fast kernel launch on the null stream. This caused
consumers of launch_and_time_kernel to receive a negative elapsed time,
which they reasonably treat as an error, breaking otherwise-correct
kernel executions.
## Technical Details
After calling hipEventElapsedTime, a clamp is applied in
launch_and_time_kernel before the result is returned, avoiding the
return of a physically impossible elapsed time.
The negative value from hipEventElapsedTime has been observed on
Windows. For kernels that complete in well under a millisecond, the HIP
event timestamps can alias such that the computed difference is a small
negative number (observed: ~-1.78 ms). No HIP error is reported by any
surrounding call (hipEventRecord, hipEventSynchronize, hipGetLastError),
confirming the kernel itself executed successfully.
## Test Plan
- Recompile CK and validate no kernel execution reports a negative
elapsed time during hipTensor tests.
- Pass the CI/CD pre-checking tests for CK.
## Test Result
- All tests passing
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Conv Bwd Data flush cache and profiling improvements (#6090)
## Motivation
Improve accuracy of conv bwd data perf measurements
## Technical Details
- enable flush cache
- for grouped conv we zero conv input(gemm output) inside device op, so
we also include this in time measurement
- for non-grouped conv we zero conv input(gemm output) outside device op
(in profile_conv_bwd_data_impl.hpp) so it is not included.
- In this pr I changed it to include zeroing if time_kernel/flush cache
is enabled so at now you should have more fair comparison. I changed it
only for time_kernel/flush_cache because MIOpen run own zeroing for
non-grouped solvers.
## Test Plan
test_grouped_conv_bwd_data_*
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Add missing gfx1033 to gfx103 group definition in ck (#5141)
## Motivation
Resolving PyTorch build failures when enabling builds for gfx103X-all
family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033`
is the only failing architecture in the family and the failures point to
missing support in CK.
## Technical Details
PyTorch build fails with repeated error message
```
/__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
33 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
`gfx1033` is missing from the `__gfx103__` group which results in
`CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in
`gfx1033` to the missing files which should be the minimum fix to allow
torch builds to pass.
## Test Plan
Compile sample test file and target gfx1033
```
...
#ifdef __HIP_DEVICE_COMPILE__
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value");
#else
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value");
#endif
```
## Test Result
Prior to the applying patch, compilation fails with `error: use of
undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'`
After applying patch, test file compiles successfully.
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022)
## Motivation
Consumers that identify kernels by their `GetTypeString()` (such as
hipTensor's actor-critic kernel selection, which hashes the string into
a
stable cross-platform UID) were silently dropping one of two colliding
variants during registry insertion.
`GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle`
previously
printed 13 template parameters, omitting
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.
These four parameters determine the block-transfer access width and LDS
padding strategy, and are precisely what differentiates the `kk`, `kn`,
`mk`, and `mn` layout variants from one another when all other geometry
parameters are equal. Two instantiations with identical 13-parameter
strings
are distinct C++ types that accept different stride layouts and reject
each
other's arguments via `IsSupportedArgument`.
This patch extends the output to 17 parameters so that every distinct
template instantiation of this class produces a unique
`GetTypeString()`.
## Technical Details
`include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`:
- extend `GetTypeString()` from 13 to 17 parameters including
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.
## Test Plan
Build CK and hipTensor with these changes, and verify hipTensor can
differentiate and select the
correct kernels with layout variations.
## Test Result
CK is building correctly and hipTensor is selecting the kernels
correctly.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] fix clang lifetimebound errors with staging compiler (#5921)
## Motivation
The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.
## Technical Details
Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:
**`ck::` namespace:**
- `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
- `index_array` -- `operator[]`
- `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
- `IndexLookupTable` -- `operator[]`
**`ck_tile::` namespace:**
- `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `static_array` -- `operator[]`
- `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `make_kernel()` -- parameter pack
Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
accordingly.
## Test Plan
- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] More lifetime-warning suppression (#5639)
## Motivation
The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.
## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix min k_batch calculation in conv kernels (#5785)
## Motivation
Avoid division by 0 and remove not needed "-1".
## Technical Details
Our div up implementation return lower value if input is divisible.
There is no need to subtract 1.
## Test Plan
test_grouped_conv_bwd_weight
## Test Result
Passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1019
[CK] Fix MOE FP8 SplitK buffer descriptor OOB (#5086)
When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base
pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces
are still based on full K dimension. This causes hardware buffer
resource descriptors to extend beyond the actual tensor allocation,
leading to GPU memory access faults when the tensor happens to be placed
at the end of an allocated memory pool region.
Fix by subtracting the split offset from each buffer's element space in
both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer
descriptor range [shifted_base, shifted_base + reduced_space) exactly
covers the valid allocation.
Also refactor SplitKBatchOffset to accept const Problem& (instead of
Argument&) and add a default constructor, enabling direct reuse in
Run/Run_2Lds without duplicating offset calculation logic.
Made-with: Cursor
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Yi DING <yi.ding@amd.com>
[CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454)
## Motivation
Flush cache to get more stable results during profiling old ck and ck
tile.
## Technical Details
Flush cache before each kernel call and one more first run.
## Test Plan
test_grouped_conv_bwd_weight_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-966
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
[CK] fix moe memset size which is bigger than alloc (#5225)
## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114)
## Motivation
Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details
- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-786
[CK] Unify the grouped convolution gridwise Run() functions (#4421)
## Motivation
There are currently three different grouped convolution related Run()
function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`.
These are used for the different types of grouped convolution: Forward,
Backward weights, and Backward data.
The functions are very similar and should be unified to a single `Run()`
function for all types of grouped convolution.
## Technical Details
The three old `Run<>()` functions were replaced with a single unified
function.
The new `Run<>()` function is run from device implementations:
- DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3
- DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3
- DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
- DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3
- DeviceGroupedConvBwdWeight_Wmma_CShuffleV3
The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor
implementation uses a different `Run<>()` overload and was therefore not
modified.
## Test Plan
Run the following grouped convolution tests on `gfx1201`, as this
architecture is WMMA-capable:
- `test_grouped_convnd_fwd`
- `test_grouped_convnd_bwd_weight`
- `test_grouped_convnd_bwd_data`
Compilation and testing were also executed on `gfx1100` to avoid CI
problems.
## Test Result
First part (unification of `Run<>()` function): All tests successful.
Second part (integration of single `Run<>()` function as a direct call):
All tests successful.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
[CK_BUILDER] Add DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)
Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.
## Motivation
As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.
## Technical Details
This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.
## Test Plan
Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.
## Test Result
I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
[CK] Precompute SpaceFillingCurve indices to reduce compile time by 31% (#5041)
## Summary
Optimize `SpaceFillingCurve` in CK to reduce compile time by
precomputing all index values into a static constexpr lookup table.
### Problem
- `GetIndex<N>` was instantiated separately for every index value (0 to
NumAccesses-1)
- Each instantiation triggered nested `static_for` loops with O(N²)
template depth
- This caused **34,000+ template instantiations** taking **69 seconds**
in frontend
### Solution
- Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed
indices
- Add `compute_single_index()` helper using O(N) `static_for` loops
- Add `compute_all_indices()` to build entire table in one constexpr
evaluation
- `GetIndex<N>` becomes simple array lookup: `return index_table[N]`
### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp)
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Total compile time | 120.4s | 83.6s | **-31%** |
| Frontend time | 88.7s | 52.6s | **-41%** |
| GetIndex instantiations | 34,176 | 384 | **-99%** |
| GetIndex time | 69.0s | 0.11s | **-99.8%** |
| SpaceFillingCurve time | 75.7s | 4.3s | **-94%** |
## Test plan
- [x] Builds successfully with `-Werror -Weverything`
- [ ] Run existing unit tests
- [ ] Verify numerical correctness on sample kernels
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Proof of concept for removing forward declarations (#5135)
## Motivation
Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:
9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)
This is mainly required to break a circular dependency in reflection.
The architecture of that is as follows:
MyDeviceOp implements GetInstanceString(). This is typically defined
directly in the class definition (no forward declaration).
GetInstanceString() calls instance_string<MyDeviceOp>()
instance_string<MyDeviceOp>() calls
InstanceTraits<MyDeviceOp>::instance_string()
InstanceTraits has a specialization for MyDeviceOp which implements
instance_string()
So order for GetInstanceString() to work properly, InstanceTraits must
already be defined. And for InstanceTraits to be defined, the device op
needs to be defined. In order to do that, we are currently using
aforementioned forward declaration.
## Technical Details
C++'s lazy template evaluation is used by calling into an as-of-yet
undefined function static member function of
`InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then
specializing `InstanceTraits` only _after that_. The caveat here is that
both the device op itself as well as the instance traits specialization
must be in scope, otherwise there would be an undefined function error.
In practise, we can solve that either by placing the instance traits
directly into the file that defines `MyDeviceOp`, or possibly by using a
`.inc` file to keep the concerns separated.
## Test Plan
The results were verified by running the existing regression tests for
CK Builder
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>
Compile-time optimize threadwise slice transfer (#4673)
## Motivation
Profiling with `-ftime-trace` on representative translation units (e.g.,
`device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`)
revealed
that **92% of frontend time was spent in template instantiation**. The
primary
bottleneck was redundant instantiation of identical helper logic across
multiple
threadwise transfer class variants.
Each `ThreadwiseTensorSliceTransfer_v*` class independently contained
its own
copy of the same helper computations — serpentine traversal, coordinate
stepping, thread scratch descriptors, lambda-like functors, and
compile-time
constants — duplicated across 13 header files. When a typical GEMM or
convolution kernel TU includes blockwise operations (e.g.,
`blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants
simultaneously, causing the compiler to instantiate the same helper
logic
multiple times with the same template arguments.
This was compounded by the helpers being defined as members of the outer
`ThreadwiseTensorSliceTransfer_v*` classes, which carry 14+ template
parameters.
Functions like `ComputeForwardSweep` depend only on their two argument
types,
but as inline members of the outer class, the compiler was forced to
create
separate instantiations for every unique combination of all outer
parameters
(data types, descriptors, vector widths, etc.) — even when most of those
parameters had no effect on the helper's output.
## Technical Details
### The Fix: Shared Helper Struct Hierarchy
Duplicated logic was extracted into a standalone helper hierarchy in
`threadwise_tensor_slice_transfer_util.hpp`:
```
ThreadwiseTransferHelper_Base (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor,
| ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple)
+-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex,
| ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable)
+-- ThreadwiseTransferHelper_SFC (ComputeSFCCoordinateResetStep)
```
Each helper method is now parameterized **only by what it actually
uses**:
- `ComputeForwardSweep(idx, lengths)` — parameterized only by the two
argument
types, not by `SrcData`, `DstData`, `SrcDesc`, etc.
- `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by
the
descriptor and access sequence types.
- `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector,
DimAccessOrder>()` — parameterized only by the four values it actually
needs.
This reduces template instantiation work in two ways:
1. **Across different transfer variants** (v3r1 vs v3r2 vs v3r1_gather):
the
compiler reuses a single instantiation instead of creating one per
variant.
2. **Across different outer class instantiations** (fp16 vs bf16 vs
int8): the
compiler reuses the helper instantiation because the helper doesn't
depend
on the data type at all.
### Refactored Headers
**13 headers** now delegate to the shared helpers instead of duplicating
logic:
- Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant
- SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter
- Dead code removed: v4r1, v5r1
### Additional Fixes Found During Refactoring
- Two latent bugs in v3r2 (`forward_sweep` indexing,
`GetDstCoordinateResetStep` extraction)
- Dead `SrcCoordStep` variables in v4r1 and v5r1
- Unused `scale_element_op_` member in v3r1_dequant (restored with note)
### Net Code Change
+1,428 / -2,297 lines (~870 lines removed).
## Test Plan
### Unit Tests
28 host-side gtests in
`test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp`
covering the full helper hierarchy:
| Suite | Tests | What is verified |
|-------|-------|------------------|
| ThreadwiseTransferHelperBase | 6 | Compile-time constants,
inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false
in 2D and 3D |
| ThreadwiseTransferHelperSerpentine | 9 | `ComputeForwardSweep`
(even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete),
`ComputeDataIndex`, `ComputeCoordinateResetStep`,
`VectorSizeLookupTable`, `VectorOffsetsLookupTable` |
| ThreadwiseTransferHelperSFC | 6 | `ComputeSFCCoordinateResetStep` —
single access, 2D row-major, 2D column-major, 3D batch, even/odd inner
access counts |
| ThreadwiseTransferHelperInheritance | 3 | Serpentine and SFC derive
from Base, are not related to each other |
| DetailFunctors | 4 | `lambda_scalar_per_access`,
`lambda_scalar_step_in_vector`,
`lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) |
### Semantic Equivalence
GPU ISA comparison using `--cuda-device-only -S` confirmed identical
assembly
output (modulo `__hip_cuid_*` metadata) between baseline and refactored
code.
## Test Results
All measurements on a 384-core machine, `-j64`, freshly rebooted,
near-idle.
### Targeted Builds (affected targets only)
| Target | Baseline | Refactored | Wall-clock Delta | CPU Delta |
|--------|----------|------------|-----------------|-----------|
| `device_grouped_conv2d_fwd_instance` (160 TUs) | 7m 37s / 189m CPU |
6m 53s / 161m CPU | **-9.7%** | **-14.9%** |
| `device_grouped_conv3d_fwd_instance` (185 TUs) | 9m 49s / 202m CPU |
6m 42s / 182m CPU | **-31.8%** | **-10.0%** |
| **Combined** | **17m 27s / 392m CPU** | **13m 35s / 344m CPU** |
**-22.2%** | **-12.4%** |
### Full Project Build (8,243 targets)
| Metric | Baseline | Refactored | Delta |
|--------|----------|------------|-------|
| Wall-clock | 103m 38s | 111m 56s | +8.0%* |
| CPU time | 4705m 7s | 4648m 17s | **-1.2%** |
\*Wall-clock inflated by external load spike during refactored build
(load 90 vs 66). CPU time is the reliable metric.
### Context
~15% of all build targets (1,262 / 8,243) transitively include the
modified
headers. These are primarily GEMM and convolution kernel instantiations
— the
core compute workloads. The 12-15% CPU savings on affected targets is
diluted
to 1.2% across the full project because 85% of targets are unaffected.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
[CK] use int64 for ptr offset (#5094)
## Motivation
When the number of experts (E) is large (e.g., E=257 in DeepSeek-V3),
the `expert_id * expert_stride` calculation in MOE GEMM kernels
overflows `int32` (`index_t`), causing the weight matrix (B) pointer to
wrap to an invalid address and triggering a GPU memory access fault.
For example, with `N=1024, K=7168, IsInputGemm=true`:
- `expert_stride = N * K * 2 = 14,680,064`
- `INT32_MAX / expert_stride ≈ 146`
- Any `expert_id >= 147` causes overflow → negative offset → illegal
memory access → GPU crash
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: amd-shiraz <shiraz.ali@amd.com>
Add generate_identity_sequences helper and replace lambdas with named functors (#4828)
## Summary
- Add `generate_identity_sequences<N>()` helper that returns
`Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>`
- Replace lambdas with named functors in `transform_tensor_descriptor`
- Add `unpack_and_merge_sequences` helper functor
- Reduces `transform_tensor_descriptor` instantiations from 388 to 32
(92% reduction)
## Motivation
Multiple call sites use `generate_tuple([](auto i) { return
Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda
instantiations.
Additionally, each lambda in `transform_tensor_descriptor` creates a
unique closure type, causing the function to be instantiated separately
for every call site. Named functors share a single type, so the compiler
reuses the same instantiation.
## Changes
### Part 1: generate_identity_sequences helper
- Replaces common lambda pattern for generating identity sequences
- Each lambda expression creates a unique closure type, causing separate
template instantiations at every call site
- Named helper shares a single type across all uses
### Part 2: Named functors in transform_tensor_descriptor
- Add `unpack_and_merge_sequences` helper to replace lambda in
`GetNumOfHiddenDimension`
- Use `generate_identity_sequences` in `matrix_padder.hpp`
## Test Plan
- [x] Added 7 unit tests:
- 4 tests for `generate_identity_sequences`
- 3 tests for `unpack_and_merge_sequences`
- [ ] Waiting for full CI
## Related PRs
This PR merges the functionality from:
- ROCm/composable_kernel#3588 (generate_identity_sequences helper)
- ROCm/composable_kernel#3589 (Named functors in
transform_tensor_descriptor)
Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times)
**Note:** This PR supersedes #4283, ROCm/composable_kernel#3588 and
ROCm/composable_kernel#3589, which can be closed once this is merged.
---
🔁 Imported from
[ROCm/composable_kernel#3628](https://github.com/ROCm/composable_kernel/pull/3628)
🧑💻 Originally authored by @tenpercent
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
[CK] Port non-grouped convolution instances to the grouped kernels (#4875)
## Motivation
Port non-grouped convolution instances to the grouped kernels to
deprecated older non-grouped implementations.
## Technical Details
Add the same instances as non-grouped but using grouped kernel.
## Test Plan
test_grouped_convnd_fwd
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-724
[CK_Builder] added bwd data kernels to builder factory (#4582)
This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
Co-authored-by: John Shumway <jshumway@amd.com>
Implement device_grouped_gemm_fixed_nk_bias for RDNA4 (#4340)
## Proposed changes
Summary:
- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests
**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425)
## Motivation
Add support for grouped gemm multi ABD fixed NK. MR
## Technical Details
Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.
Fix epilogue selection in device implementation that caused unit test
failures
## Test Plan
Covered by added unit tests
## Test Result
CI successfully passing
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Revert "[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3" (#4762)
Reverts ROCm/rocm-libraries#4638
unfortunately, this PR interfered with the PR#4299 and caused build
errors for gfx11:
In file included from
/rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7:
In file included from
/rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11:
/rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21:
error: no matching function for call to 'CheckValidity'
553 | if(!GridwiseGemm::CheckValidity(
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~