[CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used
in addition to MIOPEN_REQ_LIBS_ONLY (#5675)
## Motivation
When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and
-DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly
installed. This change is necessary to allow hipTensor to build with
TheRock without the need to rebuild CK from source.
## Technical Details
The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and
MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake,
following the same approach used to the conv target (for MIOpen).
## Test Plan
Manually test the build and installation with
`-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON
-DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as
installed.
## Test Result
The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the
targets contraction, reduction and other, while
`-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv,
contraction, reduction and other.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] fix clang lifetime bound error in ck_builder.
## Motivation
This resolves the compilation error with latest develop compiler branch.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Re-enable daily builds with staging compiler
## Motivation
This should help us catch and fix any new compilation issues early on.
## Technical Details
We now have three compiler profiles:
* **develop**: slightly stabilized version of amd-staging with some of
the obvious offending PRs reverted, 1-2 weeks behind amd-staging;
* **amd-mainline**: more stable version of compiler, the baseline for
all other branches, e.g., release, npi, etc. 2-4 weeks behind
amd-staging.
* **amd-staging**: latest compiler version where all new PRs land, often
broken;
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: kensclin <lshyhchy@amd.com>
[CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler
(#5516)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things
- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.
## Technical Details
Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.
Revert "Ck/joye/revert oob check (#5640)"
This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419.
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Adding New Notification Detection
## Motivation
Restricting one of the notification failure patterns to match a specific
missing drivers log pattern. This will help reduce the noise of
erroneous logs. Also adding a new failure pattern to notify us of Github
access issues.
## Technical Details
- Set the failure pattern to match the exact failure observed in the
logs.
- Switching to a plain substring search so special characters are
handled literally.
- Added a new failure pattern for Github access errors.
## Test Plan
- Force a failure using the known failure patterns.
## Test Result
The forced failures were triggered and caught by the notification
system.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Fix kbatch check in grouped conv and gemm
kernels (#5555)
## Motivation
Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.
## Technical Details
Round up K / Kperxdl and divide it by Kbatch to allow tail for K.
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Prune Stream-K Tile Engine Tests
## Motivation
Stream-K tile engine tests are causing issues for build time. While we
work on a more permanent solution, these changes prune the Stream-K test
instances to help reduce the build time burden.
## Technical Details
The Stream-K team recently transitioned to using CK Tile's tile engine
infrastructure for our smoke tests. However, since tile engine creates
an individual target per kernel instance, we've found that the tile
engine tests are increasing build times. Our team is currently working
to convert our existing tile engine tests back to basic gtests. While
this work takes place, we are temporarily pruning the existing Stream-K
tile engine test instances to help reduce the build time burden.
## Test Plan
Ran the pruned test set on all gfx90a, gfx942, and gfx950.
## Test Result
All tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Improved CI infrastructure failure detection
## Motivation
This PR re-enables CI infrastructure failure detection and notification,
which had been disabled due to performance issues caused by loading
large build logs (~80k lines) into memory for pattern scanning. The goal
is to reliably detect known infrastructure failures (GPU errors, Docker
authentication issues, disk space errors, etc.) and send actionable
Teams notifications without hanging on large logs.
## Technical Details
- Replaced full build log loading and Groovy-based pattern scanning with
a streaming wget | grep -E pipe. grep scans natively so the full log is
never loaded into Groovy, resolving the hang on large logs.
- Combined all failure patterns into a single grep -E call to avoid
multiple log fetches.
- The node name is now tracked with the observed failure.
- Added a new failure pattern for device's running out of space.
## Test Plan
- Forced failures in the "Determine CI Execution" stage with all 9
failure patterns echoed to the build log.
- Simulated large log sizes (~80k lines of dummy output) to validate
pattern detection and node name extraction at realistic log scales,
including patterns placed both before and after large blocks of dummy
output.
## Test Result
All 9 failure patterns detected correctly. Teams notifications sent with
accurate log context, node name, and job links. No hangs observed on 80k
line simulated logs.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
add self healing to ref repo
## Motivation
Check for when mirror repo gets corrupted in CI
## Technical Details
We detect broken ref objects and rebuild the local mirror in that case
of corruption
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Improve access for merged groups and remove
modulo from xor (#5334)
## Motivation
[CK][CK Tile] Improve access for merged groups and remove modulo from
xor
## Technical Details
- add template parameter to xor if modulo is needed. We don't need
modulo for merged groups
- use access by m for merged groups for a tensor
-
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Move grouped conv cpp instances to build dir
(#5609)
## Motivation
Move grouped conv .cpp instances to build dir. Fix generate instances
script.
## Technical Details
Avoid CI problem when instances in experimental directory are not
removed
## Test Plan
test_grouped_convnd_*_tile
## Test Result
Pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Ck/joye/revert oob check
## Motivation
fix ck_tile's oob check.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Rename Stream-K grid function
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().
## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.
## Test Plan
Validated using the test suite that is already present.
## Test Result
All tests passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Enable MXFP6 for MX GEMM op
## Motivation
Add support for MXFP6 in the MX GEMM op in CK-Tile.
Depends on https://github.com/ROCm/rocm-libraries/pull/4594
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK TILE] Fix smart-build to run install target for
client examples (#5614)
How ninja install works:
- Builds library dependencies (device_operations, etc.)
- Installs them to CMAKE_INSTALL_PREFIX
- Skips building test executables (not install dependencies)
Affected stages (8):
- gfx942/gfx950/gfx908/gfx90a CK Client Examples
- gfx10-1/gfx10-3/gfx11/gfx12 CK Client Examples
## Motivation
Problem: When smart-build is enabled (runAllUnitTests=false), the build
step is skipped entirely. This causes client example stages to fail
because they depend on the CK library being installed to ../install.
Error seen:
Target "client_gemm" links to:
composable_kernel::device_other_operations
but the target was not found.
## Technical Details
Root cause: Line 712 only checked runAllUnitTests, so when building with
config_targets="install", the install target was never built, leaving
the install directory empty.
Fix: Added condition to always build when config_targets contains
'install'. The install target automatically builds its dependencies (the
CK libraries) but skips building tests, which aligns with smart-build
philosophy.
## Test Plan
Should be tested on CI
## Test Result
Should be tested on CI
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] add tf32 support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.
## Checklist
Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run on all changed files
- [ ] Any dependent changes have been merged
## Discussion
[CK] [CK_TILE] Improve build and test time of CI with smart
dependency parser (#5249)
## Motivation
Existing dependency parser needs full build of tests to determine which
tests are affected by code changes in a PR. This still takes 2-4 hours
for building the tests which slows down the CI as the number of tests
grow. To resolve this issue we implemented a smart dependency parser
which uses CMake Configure to parse dependencies and build only the
affected test cases. We have ensured that two approaches are available
1) CMake pre-build analysis for each PR to ensure fast build and test.
2) Ninja post-build analysis to enable full build for nightly tests.
## Technical Details
```bash
### 1. Configure the project with CMake
cmake -G Ninja -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ..
### 2. Analyze dependencies (no build required!)
python3 ../script/dependency-parser/main.py cmake-parse compile_commands.json build.ninja \
--workspace-root .. --output cmake_dependency_mapping.json --parallel 8
### 3. Find tests affected by changes
python3 ../script/dependency-parser/main.py select cmake_dependency_mapping.json origin/develop \
HEAD --test-prefix --output tests_to_run.json
### 4. Build only affected tests
ninja $(jq -r '.executables[]' tests_to_run.json | tr '\n' ' ')
### 5. Run affected tests
ctest -R "$(jq -r '.regex' tests_to_run.json)"
```
### Jenkins Integration
- Added `buildMode` to jenkinsfile to integrate both `selective` and
`full` build methods
### Known Limitations
### 1. Build-Time Generated Headers (HIGH RISK)
**Problem:** Files generated during the build process (e.g., via
`add_custom_command`) cannot be analyzed before building.
**Example:**
```cmake
add_custom_command(
OUTPUT ${CMAKE_BINARY_DIR}/generated/config.hpp
COMMAND generate_config.sh
DEPENDS template.hpp.in
)
```
**Impact:** If a source file includes `generated/config.hpp`, the
dependency won't be detected until after building.
**Mitigation:**
- CK analysis shows **no generated headers** currently used
- If generated headers are added in the future, they must be built first
- Recommendation: Generate headers in CMake configure phase (not build
phase) when possible
## Test Plan
**1. Modified Files:**
```
include/ck_tile/ops/common.hpp
include/ck_tile/ops/gemm.hpp
include/ck_tile/ops/gemm/warp/warp_gemm.hpp
```
**2. Compare tests selected between `build.ninja` and `cmake-parse`
methods**
## Test Result
- 1. The test completed in 5-6 minutes finding about 8000+ executables
that should be built.
- 2. We selected a commit 5ccc1387ea which resulted in same 7 tests with
both legacy and new methods.
-
PR | Legacy tests | Smart tests | Notes
-- | -- | -- | --
5261 | 453 | 455 | Only 2 tests (test_amdgcn_mma and
test_amdgcn_sparse_mma)
5168 | 0 | 0 | Changes in dispatcher only. No CK tests invoked.
5249 | 0 | 0 | Changes to dependency parser. No CK tests invoked
5260 | 0 | 0 | Changes in dispatcher only. No CK tests invoked.
5174 | 1 | 1 | One test from FMHA affected by this PR in both cases
5383 | 0 | 0 | Changes are only in benchmark files. Did not trigger any
tests
5445 | 1 | 1 | Changes are only to tests/ck_tile/gemm_streamk. Only
triggered one streamk test in both cases.
5454 | 3 | 3 | Both methods identified same test_grouped_conv_bwd tests
5427 | 234 | 234 | Core infrastructure header changes. Detected exactly
same tests
5388 | 85 | 85 | modifies warp-level GEMM operations (warp_gemm.hpp,
warp_gemm_dispatcher.hpp). Correctly identified all the streamK gemm
tests
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix MOE FP8 SplitK buffer descriptor OOB
When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base
pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces
are still based on full K dimension. This causes hardware buffer
resource descriptors to extend beyond the actual tensor allocation,
leading to GPU memory access faults when the tensor happens to be placed
at the end of an allocated memory pool region.
Fix by subtracting the split offset from each buffer's element space in
both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer
descriptor range [shifted_base, shifted_base + reduced_space) exactly
covers the valid allocation.
Also refactor SplitKBatchOffset to accept const Problem& (instead of
Argument&) and add a default constructor, enabling direct reuse in
Run/Run_2Lds without duplicating offset calculation logic.
Made-with: Cursor
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
CK Tile MX GEMM Packing Improvement
## Motivation
Reduce the scale loading size and also has better utilization of MFMA
scale selection.
## Technical Details
Add up the packing of mx scales.
## Test Plan
Use the existing test cases.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Add LLC-aware FMHA head grouping and head-major
scheduling on RDNA (#5018)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.
This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.
In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.
## Technical Details
This PR adds two complementary strategies:
- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
- Estimate LLC size (env override, KFD sysfs, or arch default).
- Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).
- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
- Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
improving temporal locality of per-head K/V reuse.
Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.
## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`
## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd
SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x
- TFLOPs by sequence length target: gfx1201 layout: bshd
SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Grouped Convolution backward weight profiler
flush cache (#5454)
## Motivation
Flush cache to get more stable results during profiling old ck and ck
tile.
## Technical Details
Flush cache before each kernel call and one more first run.
## Test Plan
test_grouped_conv_bwd_weight_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-966
[CK] fix moe memset size which is bigger than alloc
## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Eight Waves pipeline GEMM
## Motivation
Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM
## Technical Details
Summary:
- Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
- Use this block struct in ABQuant for encodings
- Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
- Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves
So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts
## Test Plan
Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).
Note: K padding test is disabled for this pipeline because it's not
implemented yet
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Grouped Convolution Backward Weight set of
fixes (#5387)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Grouped Convolution Backward Weight split k fixes for CK tile kernels
## Technical Details
- get k batch from kargs to get deduced k batch
- multiply zeroing size by data type size
- disable v6 (producing a incorrect results)
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
Pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK
Tile profiler (#5237)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.
## Technical Details
Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.
## Test Plan
Added units tests for CK Builder that verify the two-stage kernel
construction.
## Test Result
If CI passes, the added unit tests are passing.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Skip work if any of Grouped GEMM groups M/N/K are
zero. (#5050)
## Motivation
It's common in MoE workloads that some experts receive zero tokens,
which would result in some of the dimensions equal to zero. Currently we
handle such case only for non-persistent kernels where we have all GEMMs
information beforehand on host - we validate this during creation of
kernel arguments. However for the "dynamic" input path (persistent
kernel) this information is not available before kernel launch. Thus we
have to validate this during kernel execution. The goal is to add this
validation.
## Technical Details
Skip work if any of Grouped GEMM groups M/N/K are zero for persistent
kernel path.
## Test Plan
Add unit-tests which cover "dynamic" inputs with zero dims for
persistent kernel execution path.
## Test Result
All tests pass.
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][Examples] Adding parameters for a couple of CK examples:
-gemm_add_add_mean_meansquare_xdl_fp16 -gemm_dl_quantization_int8
-gemm_xdl_bias_relu_quantization_int8 -gemm_xdl_quantization_int8
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Changed the include order of the new WMMA/MFMA unification
framework (#5241)
Those changes are to fix the include order and make header files
independent of one another. Also the `remod.py` sript has run and
changed the `grouped_convolution.hpp` and `core.hpp` files.
## Motivation
Some headers appear to depend on include order.
For example, when moving `#include "wmma/wmma.hpp"` in
[amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp)
later in the include list, it is causing compilation errors. Also the
pre-commit script `remod.py` is shuffling includes to be in alphabetical
order and is causing compilation issues.
Expected behaviour:
Headers should be independent of one another: no header should require
another to be included first. Each header should compile correctly on
its own.
## Test Plan
The CI (that runs `remod.py`) should compile.
## Test Result
Existing CI should compile and be green.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ck_tile: add gtest unit tests for MX flatmm (gfx950)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
- Add correctness unit tests for the MX-format flatmm kernel
(`example/ck_tile/18_flatmm/mxgemm`) under `test/ck_tile/flatmm/`
- Tests cover all five dtype combinations: FP4×FP4, FP8×FP8, FP6×FP6,
FP8×FP4, FP4×FP8
- Tests cover all four kernel dispatch paths (the `has_hot_loop` ×
`tail_num` product):
- `has_hot_loop=false, tail=ODD` (K=256, num_loop=1)
- `has_hot_loop=false, tail=EVEN` (K=512, num_loop=2)
- `has_hot_loop=true, tail=ODD` (K=768, num_loop=3)
- `has_hot_loop=true, tail=EVEN` (K=1024, num_loop=4)
- Remove unsupported `-split_k` CLI option from
`tile_example_mx_flatmm`; the pre-shuffled B layout is incompatible with
K-splitting and the option silently produced wrong results
## Changes
**New files (`test/ck_tile/flatmm/`):**
- `CMakeLists.txt` — builds 40 kernel instances as a shared OBJECT
library, links into 5 per-dtype test executables; forwards
`-DCK_TILE_USE_OCP_FP8` when `CK_USE_OCP_FP8` is ON
- `test_mx_flatmm_base.hpp` — base test fixture with
`run_test_with_validation(M, N, K, kbatch=1)`
- `test_mx_flatmm_fixtures.hpp` — concrete `TestMXFlatmm` typed test
class and type aliases
- `test_mx_flatmm_fp{4fp4,8fp8,6fp6,8fp4,4fp8}.cpp` — per-dtype
`TYPED_TEST_SUITE` files
**Modified files:**
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp` — moved
`preShuffleWeight` here (was in `mx_flatmm.cpp`) so it is includeable by
both the example and the tests
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm.cpp` / `run_mx_flatmm.inc`
— removed `-split_k` CLI arg, hardcoded `k_batch=1`, fixed `k_split`
formula, updated call sites after `preShuffleWeight` move
- `test/ck_tile/CMakeLists.txt` — added `add_subdirectory(flatmm)`
[CK][CK Tile] Improvements for grouped conv fwd tile
profiling (#5114)
## Motivation
Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details
- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-786
[CK_TILE] Add the GEMM Memory pipeline to Stream-K tests
(#5242)
## Motivation
We want to extend our Stream-K coverage to include other GEMM pipeline
since our current tests only test the CompV3 pipeline.
## Technical Details
All Stream-K unit tests currently only tests one pipeline: CompV3. These
changes extend the test support to also test the Memory pipeline. Future
work will add support for additional GEMM pipelines.
The major changes are as follows:
- **Remove of fp8 and bf8 extended tests for gfx90a**: gfx90a does not
have native support for fp8 and bf8 and emulate the behavior with fp32
mfma instruction sizes. We've observed extremely long compile times for
fp8 and bf8 on gfx90a (exceeding 15 minutes), hence we've opted to
disable these tests.
- **Add the memory pipeline to the Stream-K tile engine tests**: Now our
smoke tests covers compv3 and memory pipelines.
- **Add the memory pipeline to the Stream-K extended tests**: These
changes modify the test kernel types to include the appropriate
pipeline. Each pipeline is contained within a separate kernel type to
help avoid large increases in build time.
## Test Plan
- Ran existing and added tests on all architectures.
## Test Result
- All local tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Optimize ck_tile::sequence to reduce template
instantiation depth [2A] (#5028)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
### Rationale
`ck_tile::sequence` is the most fundamental metaprogramming type in
ck_tile — it underpins
tensor dimensions, strides, loop bounds, and index calculations. Six of
its metafunctions
use recursive template instantiation, producing O(N) to O(N²)
intermediate types that
the compiler must process. When these are used inside deeply nested GEMM
pipelines with
large dimension counts, the cumulative instantiation overhead becomes a
significant
contributor to frontend compile time.
Measurements on `test_gemm_pipeline_compv6` show 84,288
`InstantiateFunction` calls in
the frontend alone. Reducing template instantiation depth in these core
utilities has a
multiplicative effect because they are called from hundreds of sites.
### What changed
| Metafunction | Before | After |
|---|---|---|
| `sequence::modify` | O(N) recursive split/merge | O(1) pack expansion
|
| `sequence_gen` | O(log N) recursive binary split | O(1) via
`__make_integer_seq` |
| `uniform_sequence_gen` | Delegates to `sequence_gen` | O(1) via
`__make_integer_seq` |
| `sequence_reverse_inclusive_scan` | O(N) recursive | O(1) constexpr
for-loop + pack expansion |
| `sequence_inclusive_scan` | Computed via reverse + flip | O(1)
constexpr for-loop (unified impl) |
| `sequence_exclusive_scan` | O(N) recursive merge chain | O(1)
constexpr for-loop + pack expansion |
| `sequence_map_inverse` | O(N²) recursive modify calls | O(1) constexpr
for-loop + pack expansion |
Supporting changes:
- Portable `__type_pack_element` fallback with `__has_builtin` guard
(hipRTC-safe, no `<tuple>` dependency)
- Renamed reserved `__integer_sequence` to `integer_sequence_wrapper`
- Adopted `static_array` from develop (PR #4355) for constexpr
computation
- Unified forward and reverse inclusive scan into a single
`sequence_inclusive_scan_impl` with `bool Reverse` template parameter
- Added `sequence_inclusive_scan` struct (new public API for forward
scan direction)
- Replaced recursive `sequence_exclusive_scan` (3 template
specializations) with `sequence_exclusive_scan_impl` using the same
constexpr for-loop pattern as inclusive scan
- Rewired `exclusive_scan_sequence` and `prefix_sum_sequence` to use new
impl
- Added `CK_TILE_HOST_DEVICE` to `exclusive_scan_sequence` and
`prefix_sum_sequence` to match sibling scan function annotations
### Technical debt and housekeeping
- Unified all `namespace impl` to `namespace detail` across sequence.hpp
for consistency
- Removed dead comment block (orphaned `integer_sequence` alternative)
- Added defensive `static_assert(sizeof...(Is) > 0)` in
`sequence_map_inverse::build_inverse`
- Converted all multi-line Doxygen blocks from `///` to `/** */` per
style guide
- Corrected `constexpr static` to `static constexpr` keyword ordering in
`static_array`
- Added blank line between `#pragma once` and first `#include` in
`static_array.hpp`
- Trimmed redundant 4-line comment on `sequence_gen_helper` to a
one-liner
- Moved `sequence_gen` Doxygen comment below `namespace detail` block so
it directly precedes the struct it documents
- Added Doxygen `@brief`/`@tparam`/`@pre` documentation for
`sequence_gen` and `sequence_map_inverse` public APIs
- Added `@brief` documentation to `static_array` explaining relationship
to `ck_tile::array`
- Added scope comment at `namespace detail` openings
**Note:** `private:`/`public:` access modifier indentation is enforced
at 4 spaces by
`.clang-format`. The style guide calls for left-alignment, but the
formatter overrides
this. Requires a `.clang-format` config change to resolve — not
addressable in code.
### `static_array` hardening (from develop's PR #4355)
- Added zero-length array guard (`T elems[N > 0 ? N : 1]`)
- Added `CK_TILE_HOST_DEVICE` annotations to `operator[]` and `size()`
- Added `#include "ck_tile/core/config.hpp"` (IWYU for
`CK_TILE_HOST_DEVICE`)
### Value
Combined with the `static_ford` changes, measured impact on
`test_gemm_pipeline_compv6`:
- **Frontend: -28.9%** (InstantiateFunction: 84,288 → 69,439)
- **Backend: -13.1%** (CodeGen Functions: 3,170 → 2,203)
- **Wall-clock: -16.3%** (611.6s → 512.2s)
### Files changed (4)
- `sequence.hpp`: Metafunction optimizations, namespace unification,
documentation, style fixes
- `static_array.hpp`: Zero-length guard, `CK_TILE_HOST_DEVICE`,
documentation, style fixes
- `test_sequence.cpp`: 50 unit tests with runtime `EXPECT_EQ` assertions
(new file)
- `CMakeLists.txt`: Register new test target
## Test plan
- [x] 50 runtime unit tests covering all optimized and pre-existing
sequence APIs
- [x] Edge cases: empty sequences, single-element, larger sizes (N=8),
negative values, non-trivial init values
- [x] Both functor signatures tested (`operator()(index_t)` and
`operator()(number<I>)`)
- [x] Both scan reducers (`plus`, `multiplies`) with forward, reverse,
inclusive, and exclusive directions
- [x] Exclusive scan: sum, product, single, empty, non-zero init
- [x] Prefix sum: N+1 output verification, single, empty
- [x] Permutation round-trip verification for `sequence_map_inverse`
- [x] Full sequence public API coverage: modify, gen, uniform_gen, scans
(inclusive, exclusive, prefix sum), map_inverse, make_index_sequence,
size/sum/product, push/pop, reverse, extract, merge, arithmetic
operators, equality, transform
- [x] Portable `__type_pack_element` fallback tested implicitly (same
`at_index_t` interface)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Unify the grouped convolution gridwise Run() functions
(#4421)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
There are currently three different grouped convolution related Run()
function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`.
These are used for the different types of grouped convolution: Forward,
Backward weights, and Backward data.
The functions are very similar and should be unified to a single `Run()`
function for all types of grouped convolution.
## Technical Details
The three old `Run<>()` functions were replaced with a single unified
function.
The new `Run<>()` function is run from device implementations:
- DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3
- DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3
- DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
- DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3
- DeviceGroupedConvBwdWeight_Wmma_CShuffleV3
The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor
implementation uses a different `Run<>()` overload and was therefore not
modified.
## Test Plan
Run the following grouped convolution tests on `gfx1201`, as this
architecture is WMMA-capable:
- `test_grouped_convnd_fwd`
- `test_grouped_convnd_bwd_weight`
- `test_grouped_convnd_bwd_data`
Compilation and testing were also executed on `gfx1100` to avoid CI
problems.
## Test Result
First part (unification of `Run<>()` function): All tests successful.
Second part (integration of single `Run<>()` function as a direct call):
All tests successful.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Add BF16^3 support to grouped conv bwd weight: bilinear
and scale (#4591)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Until now, XDL grouped conv bwd weight for bilinear and scale only
supported bf16f32bf16.
Therefore, bf16bf16bf16 support should be added.
## Technical Details
Instances were added to the relevant files in
`library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_bwd_weight/`
folder. In addition, `add()` functions were included in new files in
`library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/xdl/`
and
`library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_scale/xdl/`
folders. The new .cpp files were also included in the `CMakeFiles.txt`
files of both folders.
## Test Plan
Execute `grouped_convnd_bwd_weight` tests to check execution on
different architectures.
The tests for bilinear and scale already include the tuple
`std::tuple<ck::half_t, ck::half_t, ck::half_t, ck::Number<3>>`, so in
principle, there is nothing to modify in the tests themselves.
## Test Result
`gfx1201`: Tests passed.
`gfx1100`: Tests passed.
`gfx90a`: Tests passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on
gfx950 (#4368)
## Motivation
Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline
## Technical Details
The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.
Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
* both 32x32x64 and 16x16x128 scale MFMAs are supported
* Q and K scales are applied in hdim, V scales - in seqlen dimension
* column-major V only
* batch and group mode
* bias, Alibi (tested but no instances by default, just like fp8)
* masking etc.
Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008
## Test Plan
```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```
## Test Result
The tests must pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE ENGINE] Add grouped_gemm operator to Tile Engine
(gfx942/gfx950) (#4996)
## Motivation
The grouped_gemm CK Tile kernel exists (e.g.,
`example/17_grouped_gemm/`) but has no Tile Engine wrapper. Grouped GEMM
handles multiple independent GEMM problems with varying M/N/K dimensions
in a single kernel launch. This PR adds the Tile Engine infrastructure
for automated kernel generation, benchmarking, and profiling of grouped
GEMM kernels.
Jira: AICK-809
## Technical Details
- Created Tile Engine wrapper under `tile_engine/ops/gemm/grouped_gemm/`
following the `gemm_universal` template
- Files added: `CMakeLists.txt`, `grouped_gemm_common.hpp`,
`grouped_gemm_benchmark.hpp`, `grouped_gemm_profiler.hpp`,
`grouped_gemm_benchmark.py`, `grouped_gemm_benchmark_single.cpp`,
`grouped_gemm_instance_builder.py`, `configs/`
- Supported datatypes: fp16, fp8, bf16, bf8
- Supported layouts: rcr, rrr, ccr, crr
- Target GPUs: gfx942, gfx950
- CK Tile kernel: `ck_tile::GroupedGemmKernel` from
`include/ck_tile/ops/gemm/kernel/grouped_gemm_kernel.hpp`
- Instance builder extends `GemmKernelBuilder` base class
- Registered in `tile_engine/ops/gemm/CMakeLists.txt`
- Updated Jenkinsfile to build and benchmark grouped_gemm targets in CI
- Benchmark infrastructure includes JSON output, CSV export, and
verification support
## Test Plan
- CMake configure succeeds for grouped_gemm targets
- Kernel instance builder generates valid kernel headers for all
(datatype, layout) combinations
- At least one kernel binary compiles and runs per datatype/layout
combination
- Correctness passes with `--verify 1` on gfx942/gfx950
## Test Result
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_BUILDER] Add
DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)
Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.
## Motivation
As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.
## Technical Details
This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.
## Test Plan
Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.
## Test Result
I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
[CK] Fix warp tile combination selection in absence of a GPU
(#5213)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
The `get_gpu_name_by_id()` function in
`gemm_streamk_validation_utils.py` relies on `rocminfo` to detect the
GPU architecture at runtime. However, __`rocminfo` fails in CI/build
environments__ where:
- No physical GPU is present
- ROCm tools are not installed
- The build is running in a container without GPU access
In any of these environments, the problem manifests itself in incorrect
kernel validation and will generate template instantiations that do not
exist:
```
[composable_kernel] FAILED: test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o
[composable_kernel] /__w/TheRock/TheRock/build/core/clr/dist/lib/llvm/bin/clang++ -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_ENABLE_TF32 -DCK_TILE_USE_WMMA=0 -DCK_TIME_KERNEL=1 -DCK_USE_FNUZ_FP8 -DCK_USE_GFX94 -DCK_USE_XDL -DDPP_KERNELS -DGEMM_SINGLE_INSTANCE_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp\" -DGEMM_TEST_PARAMS_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/test_params.hpp\" -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/profiler/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/library/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include -I/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/include -I/__w/TheRock/TheRock/build/profiler/rocprofiler-sdk/stage/include -I/__w/TheRock/TheRock/build/profiler/roctracer/stage/include -I/__w/TheRock/TheRock/build/base/half/stage/include -I/__w/TheRock/TheRock/build/third-party/sysdeps/linux/libdrm/build/stage/lib/rocm_sysdeps/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx942 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -Wno-undefined-func-template -Wno-float-equal --offload-compress -include /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp -MD -MT test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -MF test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o.d -o test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -x hip -c /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/test_gemm_streamk_simple.cpp
[composable_kernel] In file included from <built-in>:2:
[composable_kernel] In file included from /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp:9:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm.hpp:23:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1.hpp:7:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1_default_policy.hpp:8:
[composable_kernel] /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp:185:1: error: implicit instantiation of undefined template 'ck_tile::impl::warp_gemm_dispatcher::Dispatcher<_Float16, _Float16, float, 16, 16, 8, false, false, false, ck_tile::WGAttrNumAccessEnum::Single, ck_tile::WGAttrNumAccessEnum::Single>'
```
## Technical Details
### Changes Made:
#### 1. __gemm_streamk_validation_utils.py__
- Added module-level storage: `_configured_gpu_targets`
- Added `set_gpu_targets(targets: List[str])` to configure fallback GPU
targets
- Added `get_configured_gpu_targets() -> List[str]` to retrieve
configured targets
- Enhanced `get_gpu_name_by_id()` to:
- First try `rocminfo` (existing behavior)
- If `rocminfo` fails, fall back to first configured GPU target
- Extract base gfx name (e.g., "gfx90a" from "gfx90a:xnack+")
- Log debug messages when using fallback
#### 2. __gemm_streamk_instance_builder.py__
- Added `--gpu_targets` command-line argument
- Automatically calls `set_gpu_targets()` when `--gpu_targets` is
provided
- Parses semicolon-separated GPU target list from CMake
#### 3. __test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt__
- Modified both `--list_kernels` and `--gen_single` invocations to pass
`--gpu_targets "${SUPPORTED_GPU_TARGETS}"`
- GPU targets are now automatically wired from CMake to Python scripts
### How It Works:
1. __CMake Configuration__: `SUPPORTED_GPU_TARGETS` is determined from
`GPU_TARGETS` or defaults
2. __CMake → Python__: CMake passes targets via `--gpu_targets` argument
to Python scripts
3. __Python Configuration__: Scripts call `set_gpu_targets()` to
configure the fallback
4. __Fallback Mechanism__: When `rocminfo` fails, `get_gpu_name_by_id()`
uses the first configured target
5. __Target Parsing__: Extracts clean gfx name (e.g., "gfx90a" from
"gfx90a:xnack+")
## Test Plan
Confirm that only the appropriate kernels are selected and that CI
passes.
## Test Result
1. Waiting on CI
2. Compilation succeeded locally and the kernel list does not contain
the 16x16x8 kernel for gfx942 anymore:
```
(.venv) bhargrea@ctr-cx66-mi300x-02:~/github/TheRock$ cat build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_kernel_list.txt
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x16|256x256x32_2x2x1_16x16x16|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x16|256x256x32_2x2x1_16x16x16|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x32|256x256x32_2x2x1_16x16x32|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x32|256x256x32_2x2x1_16x16x32|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x8|256x256x32_2x2x1_32x32x8|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x8|256x256x32_2x2x1_32x32x8|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x16|256x256x32_2x2x1_32x32x16|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x16|256x256x32_2x2x1_32x32x16|compv3_cshuffle_intrawave_atomic_False_False_False_False
```
## Submission Checklist
- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] MX GEMM non-preshuffled RCR layout
## Motivation
Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Precompute SpaceFillingCurve indices to reduce compile
time by 31% (#5041)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Optimize `SpaceFillingCurve` in CK to reduce compile time by
precomputing all index values into a static constexpr lookup table.
### Problem
- `GetIndex<N>` was instantiated separately for every index value (0 to
NumAccesses-1)
- Each instantiation triggered nested `static_for` loops with O(N²)
template depth
- This caused **34,000+ template instantiations** taking **69 seconds**
in frontend
### Solution
- Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed
indices
- Add `compute_single_index()` helper using O(N) `static_for` loops
- Add `compute_all_indices()` to build entire table in one constexpr
evaluation
- `GetIndex<N>` becomes simple array lookup: `return index_table[N]`
### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp)
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Total compile time | 120.4s | 83.6s | **-31%** |
| Frontend time | 88.7s | 52.6s | **-41%** |
| GetIndex instantiations | 34,176 | 384 | **-99%** |
| GetIndex time | 69.0s | 0.11s | **-99.8%** |
| SpaceFillingCurve time | 75.7s | 4.3s | **-94%** |
## Test plan
- [x] Builds successfully with `-Werror -Weverything`
- [ ] Run existing unit tests
- [ ] Verify numerical correctness on sample kernels
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
[CK] Streamk tile engine test not setting a reasonable
CU_COUNT default when the query fails (#5165)
## Motivation
The following error was coming up when compiling on Windows when the
generate_configs.py file tries to query the GPU for the number of CU's:
```
[composable_kernel configure] -- Generating Stream-K test config files for fp16
[composable_kernel configure] Traceback (most recent call last):
[composable_kernel configure] File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 277, in <module>
[composable_kernel configure] main()
[composable_kernel configure] ~~~~^^
[composable_kernel configure] File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 271, in main
[composable_kernel configure] cu_count, configs_dir_path, tile_sizes, datatype = get_args()
[composable_kernel configure] ~~~~~~~~^^
[composable_kernel configure] File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 267, in get_args
[composable_kernel configure] return (int(args.cu_count), args.configs_dir_path, args.tiles, args.datatype)
[composable_kernel configure] ~~~^^^^^^^^^^^^^^^
[composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n'
[composable_kernel configure] CMake Error at test/ck_tile/gemm_streamk_tile_engine/generate_configs.cmake:98 (message):
[composable_kernel configure] Eror occured during execution of
[composable_kernel configure] E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/generate_configs.py
[composable_kernel configure] Call Stack (most recent call first):
[composable_kernel configure] test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt:301 (generate_test_configs)
[composable_kernel configure]
[composable_kernel configure]
[composable_kernel configure] -- Configuring incomplete, errors occurred!
[composable_kernel configure FAILED WITH CODE 1 in 41 seconds]
ninja: build stopped: subcommand failed.
```
## Technical Details
There was one major problem in the following code and two changes were
made:
```
execute_process(
COMMAND ${CPP_EXE_PATH}
OUTPUT_STRIP_TRAILING_WHITESPACE
ERROR_VARIABLE standard_error
RESULT_VARIABLE queried_cu_count
)
if (standard_error)
message(STATUS "Error information from attempting to query HIP device and properties:\n"
"${standard_error}")
endif()
```
1. RESULT_VARIABLE does not capture the IO output of the executable, but
rather the exit code. You can see from the error output here that it was
trying to cast "Exit code 0xc0000135\n" to an integer. I fixed this by
changing RESULT_VARIABLE to OUTPUT_VARIABLE.
```
[composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n'
```
Note that this also gives us the reason that the query failed: Exit code
0xc0000135, which needs to be addressed in a separate issue: "Exit code
0xc0000135, also seen as -1073741515, is a Windows error indicating that
an application failed to start because a required Dynamic Link Library
(DLL) file or a system component like the .NET Framework is missing or
corrupted"
It's likely the executable that is created from this code can't find the
hip dll, or something similar:
```
set(CPP_FILE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cu_count.cpp)
set(CPP_EXE_PATH ${CMAKE_CURRENT_BINARY_DIR}/cu_count)
execute_process(
COMMAND ${CMAKE_HIP_COMPILER} -x hip ${CPP_FILE_PATH} -o ${CPP_EXE_PATH}
RESULT_VARIABLE compile_result
)
```
2. For clarity and consistency purposes, I changed the check afterwards
to explicitly look for a non-zero exit code. This matches previous
checks in the cmake file. I also added improved error checking when the
query for the cu count fails.
## Test Plan
Ensure it compiles locally and existing CI isn't impacted.
## Test Result
Waiting on CI.
## Submission Checklist
- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_BUILDER] Clean up ConvDescription output formatting
(#5085)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The `ConvDescription::getDetailedDescription()` output had several
issues that made it harder to read and potentially misleading:
1. **Bug fix**: The LDS padding field was incorrectly displaying
`dst_scalar_per_vector_k1` instead of the actual `lds_padding` value
2. **Noise reduction**: Optional parameters that weren't set were
printing unhelpful messages like "Struct does not contain optional
gemm_padding argument" — these add clutter without providing value to
the reader
3. **Formatting inconsistencies**: Trailing spaces after colons (e.g.,
`"Warp Gemm parameters: "`) and a stray trailing `×` in tile dimensions
4. **Missing thread cluster lengths**: The threads per axis are not
shown.
**Changes**:
- **Fixed the LDS padding bug** by using
`traits_.a_tile_transfer.transfer_params.lds_padding` and
`traits_.b_tile_transfer.transfer_params.lds_padding` instead of
duplicating `dst_scalar_per_vector_k1`
- **Simplified optional parameter handling**: Changed from printing
"Struct does not contain..." messages to simply omitting absent optional
values. Also switched from `.value_or()` to direct dereference (`*`)
since we're already inside an `if` check
- **Cleaned up formatting**: Removed trailing spaces after colons and
the extra `×` at the end of tile dimension lists
- **Added missing thread cluster lengths**: Added X×Y×Z" display for
both A and B tile transfer sections.
- **Fixed typo**: "Do Padd Gemm" → "Do Pad Gemm"
- **Fixed typo**: "scr" → "src"
- **Fixed typo**: "tensros" → "tensors"
- `ninja smoke-builder` ✓
- `ninja check-builder` ✓
The test file updates reflect the corrected expected output, which now
shows the actual `lds_padding` values (0 or 1), shows thread cluster
lenths, and omits the verbose "Struct does not contain..." lines.
**Note**: This PR follows PR #5083.
[CK] Fix the issue of the aiter to call eightwarps pipeline.
(#5218)
## Motivation
Fix the failure of the aiter to call eightwarp.
Changed Async to the name eightwarps.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
Pass
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.