composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-28 02:37:01 +00:00

Author	SHA1	Message	Date
Enrico Degregori	2733e75900	[rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e) [CK Tile] Async support pipeline V3 ## Motivation Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async pipeline) ## Technical Details - Add `Async` bool to `Problem` struct to enable async pipeline in existing one - Add `static_move_ys` to load transpose. This generates offset in assembly instructions saving registers - Add `is_valid` to `async_get_vectorized_elements`. Before hard coded to true. It allows to support padding - Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr` (wider use of lds load transpose on gfx950) - Integrate async support in existing V3 pipeline (avoid pipelines duplication) - Create policy to support both async and default cases. This could be used by any async pipeline (next steps) - Define `wg_attr_num_access` separately for A and B. This allows to optimize ds_read instruction width for cases when one matrix is transposed and the other is not. Before in such cases, `ds_read_b64` was used instead of `ds_read_b128` - Add test for V3 async. Currently only supporting cases with A and B having the same type ## Test Plan New test `test_ck_tile_gemm_pipeline_compv3_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 06:57:14 +00:00
Sami Remes	a3a12b8945	[rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf) [CK_TILE] Enable full transpose layout support for MX GEMM pipeline (#5813) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Enable full transpose layout support for MX GEMM pipeline (32x32x64 MFMA) ### Summary This PR enables all four matrix layout combinations (Row/Col, Row/Row, Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously, only the canonical `A=RowMajor, B=ColumnMajor` layout was supported. ### Changes Kernel-side transpose support: - `warp_gemm_attribute_mfma.hpp`: Introduce `kSplitFactor` logic in `get_warp_dstr_encoding` to split the K-dimension distribution encoding when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This satisfies the `TransposeTileDistributionTraits` suffix validation required by `load_tile_transpose`. The distribution encoding now also receives the `DataType` template parameter to compute the split factor based on packed element size. - `gemm_pipeline_ag_bg_cr_comp_async.hpp`: Uncomment and enable the `InputTileDistributionTraits` logic to properly transform LDS load tile distributions for transposed reads. Add `static_assert`s to catch misconfigurations where a layout requires transpose loads but the warp tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr` limits). - `load_tile_transpose.hpp`: Fix `DataVec` sizing for packed types (`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer overflow when each physical element contains multiple logical values. - `warp_gemm_attribute_mfma_impl.hpp`: Set `kDefaultScale` to `0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()` overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`. - `warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`: Add generic `WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher specialization to support arbitrary MX data type combinations (fp4, fp6, fp8) with the 32x32x64 MFMA, consolidating the existing type-specific aliases. - `gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`: Simplify `wg_attr_num_access` determination — `Double` for fp8, `Single` otherwise. Reference implementation fix: - `reference_gemm.hpp`: Fix nibble selection for packed 4-bit types (`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and `reference_gemm_abquant`. The previous logic used `k % 2` or `index[K_DIM] & 1` to select which nibble to extract, which assumed K was always the fast (contiguous) memory dimension. This is only true for `A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its parity to correctly select the nibble regardless of layout. Test infrastructure: - `test_mx_gemm_config.hpp`: Add `MxGemmConfig32` base and `MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64 warp tile. - `test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`: Add `Config32` test suites covering all four layout combinations. Restrict `Config16` (16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds `ds_read_tr` limits. - `test_mx_gemm_util.hpp`: Fix scale tensor layout — scales are always row-major `[M, K/32]` and column-major `[K/32, N]`, independent of A/B data layout. ### Test plan - [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)	2026-06-18 17:05:09 +00:00
Enrico Degregori	1762eaeaec	[rocm-libraries] ROCm/rocm-libraries#8535 (commit a0f47eb) [CK Tile] EightWaves pipeline int8 support ## Motivation EightWaves pipeline currently is supporting only FP types ## Technical Details - Enable 16x16x64 int8 instruction for gfx950 in dispatcher - Enable int8 in EightWaves pipeline - Add tests - Fix bug in `warp_gemm_attribute_mfma_impl.hpp` ## Test Plan Tests have been added for int8 GEMM using EightWaves pipeline ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 12:59:59 +00:00
Aviral Goel	c43b550206	[rocm-libraries] ROCm/rocm-libraries#8202 (commit 0911fa0) [GFX1250][CK_TILE] Add scale16 (ScaleBlockSize=16) support to MX GEMM TDM pipeline (#8202) Enables `ScaleBlockSize=16` end-to-end for the FP8/BF8 MX GEMM TDM pipeline, building on the scale16 warp-gemm layer already in develop. - warp gemm: add the 32x32x128 f8f6f4 scale16 traits and alias (2x2 grid of 16x16x128 scale16 intrinsic calls with per-subtile `SCALE_OPSEL`), and route 32x32 f8f6f4 through the dispatcher's `IsScale16` path. - default policy: select the warp gemm via the dispatcher with `IsScale16=(ScaleBlockSize==16)` so `WarpTile=16` and `WarpTile=32` each pick the matching scale16 path; guard WarpTile M/N to 16 or 32; scale-tile distribution for the scale16 layout. - pipeline V1/V2: thread `Problem::ScaleBlockSize` through the scale-window setup (replacing the hardcoded 32); expose `ScaleBlockSize` for the kernel. - block gemm: extract int64 (scale16) / int32 (scale32) scales by width. - kernel: scale16 descriptor order; reject unsupported `BlockScaleSize`. Test coverage for this path is in the stacked follow-up PR.	2026-06-17 16:41:00 +00:00
SamiAario-AMD	39182b50eb	[rocm-libraries] ROCm/rocm-libraries#8487 (commit 06a73ba) Skip tests on gfx11 that have intermittent failures ## Motivation On gfx11, skip sporadic failures for any load_and_convert_tile case where X and Y differ. Same-type tuples (half/half, bf16/bf16, fp8/fp8) have been stable. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result Stress-tested on gfx11, gfx12, and gfx950 with 10000 iterations of the tests. No remaining test failures were detected. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-17 11:07:22 +00:00
Andriy Roshchenko	b8440b3aeb	[rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6) [GFX1250][MX GEMM] Unified FLATMM GroupedGemm Implementation for MX Data Types (#8325) ## Motivation Design and test a unified FLATMM GroupedGemm interface so that it supports all MX FP8, FP6, and FP4 data types on both the gfx950 and gfx1250 architectures and works seamlessly across these platforms. ## Technical Details Implementation exposes Grouped Gemm interface for MX FLATMM and MX TDM FLATMM pipelines. ## Test Plan Add the following tests: - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_non_tdm.cpp - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_tdm.cpp - ck_tile/flatmm/test_mx_flatmm_persistent.cpp Verify on the gfx950 and gfx1250 architectures. ## Test Result All tests pass. Verified on A0 hardware with rocm-7.14.0a20260517 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 16:12:33 +00:00
Sami Remes	c1f7104852	[rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01) [CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K, unsupported-shape guard (#6663) Three independent MX GEMM correctness bugs reported against example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one host-side atomic-add accumulation bug in the example's repeat loop. - Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop required num_loop > PrefetchStages, which let num_loop == 3 enter a hot loop that produced 5 gemm accumulations instead of 3 (K == 3K_Tile, e.g. K=768, deterministically wrong). Require num_loop >= 4 instead: pre-pipeline + TailNumber::Three already totals exactly 3. - Kernel (gemm_mx_kernel.hpp): split-K was silently broken because GridSize did not thread k_batch into blockIdx.z and the scale tile windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore read the wrong packed scales. Fix: GridSize returns dim3(grid_x, 1, k_batch) (persistent and non-persistent). * MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a packed-scale K offset (also apply pad_tensor_view so OOB scale loads return zero, matching A/B padding). * operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset (matches Underlying::SplitKBatchOffset's K1-aligned formula), and dispatches the epilogue with memory_operation_enum::atomic_add for k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard as UniversalGemmKernel. * MakeCBlockWindows templated on DstInMemOp; unconditionally applies pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are handled correctly. - Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp): add IsSupportedArgument and a static_assert for configurations that produce silent wrong results: * static_assert(!kPadK) -- the MX comp-async pipeline uses async_load_tile whose OOB check is per-vector-start, so a vector straddling the K pad boundary reads garbage. Until the async path learns per-element pad masking, reject kPadK at compile time. * Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always (CShuffleEpilogue cannot safely run with a single partial tile); K % (KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of WarpTile_K * k_batch so every split lands on a packed-scale boundary. * All error paths log under CK_TILE_LOGGING with actionable messages. - Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp): * Call Kernel::IsSupportedArgument up front and throw a clear runtime_error for rejected shapes (was silently launching an unsupported kernel). * Switch to launch_kernel_time_mask with a clear_gemm_output preprocess that zeroes C between iterations when k_batch > 1 (mirrors universal_gemm_invoker). Without this the default -warmup=50 -repeat=100 accumulated 150 atomic_adds into C after the kernel-side split-K fix. Tests (test/ck_tile/gemm_mx/): - Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true). - test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression), SplitK (k_batch=2,4 across full_k/partial_k paths), TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding} covering trailing partial tiles along M, N, or both. - Run(...) now takes k_batch. - packScalesMNxK: guard against OOB (mn, k) reads from src and initialise e8m0 bytes to the zero exponent (0x00) instead of the default-constructed NaN (0xFF), so padded lanes don't poison the packed int32_t shared with in-range lanes. - test_mx_gemm_instance.hpp: call IsSupportedArgument before launch. Verification on gfx950, ROCm 7.2.0: - ctest -R test_ck_tile_mx_gemm -> 100% (2/2). - Example sweep over the original bug-report shapes: all K-aligned shapes now validate correct (including 4096^3 sk=2 and the K=768 cases); all K=128 shapes cleanly rejected with the new error message instead of producing silent wrong results. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 08:28:55 +00:00
SamiAario-AMD	947dcc2606	[rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c) [CK Tile] Add transposed tile load implementation, and tests for load_and_convert_tile (#5510) ## Motivation Mixed precision b/fp16 x fp8 requires a transposed tile load implementation that supports mixed precision using these types. Implement this, use it in `load_and_convert_tile`, and add a unit test for `load_and_convert_tile` which covers this functionality. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 06:42:28 +00:00
John Afaganis	329e589840	[rocm-libraries] ROCm/rocm-libraries#8260 (commit 1139236) [ck] Enforce LF-only line endings in C/C++ sources MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Several CK source files carry Windows CRLF line endings (a trailing carriage return on each line), introduced by editors configured for Windows endings or copy/paste from Windows tooling. These are purely cosmetic but they pollute diffs (whole-file churn the first time someone makes an LF edit), confuse `clang-format`, and are inconsistent with the LF-only convention used across the rest of the tree. This PR (a) normalizes every existing CRLF file (6 files) to LF and (b) adds a pre-checkin gate so new CRLF leaks are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate as the adjacent `ASCII Only Check` stage: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `[ck] Normalize CRLF line endings to LF in C/C++ sources` Mechanical line-ending cleanup across 6 files. No content change: every edit is purely CRLF -> LF, verified with `git diff --ignore-cr-at-eol` reporting an empty diff. 2. `[ck] Enforce LF-only line endings in C/C++ sources` - New `projects/composablekernel/script/check_no_crlf.sh` (modeled on `check_ascii_only.sh`). - New `crlf-checker` entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `CRLF Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the adjacent `ASCII Only Check` stage. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 0 CRLF violations; commit 2 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf 'int main() {}\r\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_no_crlf.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains CRLF (Windows) line endings: 1:int main() {}<CR> Fix: convert to LF, e.g. 'sed -i 's/\r$//' /tmp/bad.cpp' or 'dos2unix /tmp/bad.cpp' $ echo $? 1 ``` Full repo scan after the cleanup commit: ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_no_crlf.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> CRLF Check` stage runs green over the full predicate and the existing `ASCII Only Check` / `Clang Format` stages are unaffected. - [ ] Local: `pre-commit run crlf-checker --all-files` runs cleanly after installing CK pre-commit hooks. - [ ] Manually inject a CRLF line ending in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-12 21:11:59 +00:00
Wojciech Laskowski	c2601f38b7	[rocm-libraries] ROCm/rocm-libraries#6569 (commit 393049e) Adding amdgcn_mma specializations for sparse MFMA builtins (#6569) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the fourth of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801, https://github.com/ROCm/rocm-libraries/pull/6014 and https://github.com/ROCm/rocm-libraries/pull/6567) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse MFMA intrinsics. ## Technical Details This change adds new specializations for MFMA sparse builtins. In total, we add 27 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 12:48:29 +00:00
jefyang1	276863ca87	[rocm-libraries] ROCm/rocm-libraries#8259 (commit df03f10) Add cluster launch in test ck_tile mx gemm tdm wmma ## Motivation Add cluster launch test in test_ck_tile_mx_gemm_pipeline_tdm_wmma on gfx1250, so that we can check the performance on gfx1250 hardware. ## Technical Details Added Out-of-bounds guard in RunGemm of MxGemmKernel to skip blocks padded by cluster alignment. Add ClusterEnable/ClusterDisable aliases and extend the tuple in test_mx_gemm_pipeline_kernel_types.hpp by adding two kernel types with ClusterEnable for F8 CompTDMV1 and CompTDMV2 respectively. The existing F4 non-ClusterLaunch kernel types have issue to be fixed, so this PR does not include F4 cases. Read ClusterLaunch from the tuple in test_mx_gemm_pipeline_util.hpp. Update invoke_mx_gemm to branch on ClusterLaunch, including Add cluster size constants, Switch GemmShape type, TilePartitioner type, and the kernel launch call. ## Test Plan Tested the changes on gfx1250 FFM. ## Test Result The added kernel types (instances) passed the tests on gfx1250 FFM. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 17:33:11 +00:00
Illia Silin	a433424e08	[rocm-libraries] ROCm/rocm-libraries#8241 (commit cd183df) [CK] increase time limit for fmha_bwd tests to prevent timeouts (#8241) ## Motivation Observed a CI failure due to fmha_bwd test timeout which never happened before. Going to increase the time limit for the test to prevent any further CI failures. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-10 15:37:44 +00:00
Chao	320a813d67	[rocm-libraries] ROCm/rocm-libraries#6533 (commit 5dcaa45) [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward (#6533) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward ## Motivation Host-side Pack-GQA optimization for CK-Tile FMHA forward. Reshapes Q tensor from `[b, nhead_q, seqlen_q, d]` to `[b, nhead_kv, nhead_ratio * seqlen_q, d]` by adjusting strides, so grouped Q-heads sharing the same KV data are processed in a single tile. Zero kernel changes — runner-only. Phase 1: non-causal attention with GQA ratio packing. Phase 2: extends to dropout and split-kv paths, fixes stride edge cases. ## Technical Details Modified files (2): - `example/ck_tile/01_fmha/example_fmha_fwd.cpp` — Pack-GQA flag plumbing - `example/ck_tile/01_fmha/fmha_fwd_runner.hpp` — Q tensor reshape logic, stride adjustment for GQA ratio packing New files (1): - `example/ck_tile/01_fmha/test_pack_gqa_phase2.sh` — 53 test cases covering non-causal, dropout, split-kv, various GQA ratios ## Dependencies None — this PR is standalone. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0), 53 parameterized test cases: - Command (GQA 4:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (GQA 8:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=64 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (decode): `./build/bin/tile_example_fmha_fwd -b=64 -h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` ## Test Result Benchmark results (MI350X, gfx950, ROCm 7.0): \| Config \| Without Pack \| With Pack \| Improvement \| \|--------\|-------------\|-----------\|-------------\| \| GQA 4:1 prefill b=2 h=32 hk=8 s=2048 d=128 bf16 \| 690.05 TFlops (0.199 ms) \| 695.61 TFlops (0.198 ms) \| +0.8% \| \| GQA 8:1 prefill b=2 h=64 hk=8 s=2048 d=128 bf16 \| 706.25 TFlops (0.389 ms) \| 729.35 TFlops (0.377 ms) \| +3.3% \| \| GQA 8:1 decode b=64 h=32 hk=4 s_k=4096 d=128 bf16 \| 305.20 GB/s (1.763 ms) \| 1813.41 GB/s (0.297 ms) \| +5.9x \| \| LLaMA-70B decode b=32 h=64 hk=8 s_k=4096 d=128 bf16 \| 591.70 GB/s (0.909 ms) \| 1820.65 GB/s (0.295 ms) \| +3.1x \| \| MHA ratio=1 b=2 h=8 s=4096 d=128 bf16 \| 695.16 TFlops \| 702.72 TFlops \| no regression \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): No regression on MI300X. Pack-GQA is a runner-only optimization (zero kernel changes), performance impact is within noise on MI300X. \| Config \| TFlops / GB/s \| Time (ms) \| Delta vs baseline \| \|--------\|-------------\|-----------\|-------------------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 336.52 TFlops \| 0.408 \| -1.7% \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 322.52 TFlops \| 0.426 \| -0.7% \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 349.85 TFlops \| 0.786 \| +0.5% \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 381.29 TFlops \| 1.442 \| +1.2% \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 697.32 GB/s \| 1.541 \| +0.8% \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - 53 parameterized test cases pass (23 phase 1 + 30 phase 2) - GQA ratios tested: 1:1, 2:1, 4:1, 8:1, 32:1 - No regression on MHA (ratio=1) workloads - fp16 and bf16 validated	2026-06-10 01:56:44 +00:00
Aviral Goel	93e0d79103	[rocm-libraries] ROCm/rocm-libraries#8035 (commit 45186b8) [CK_Tile] Add wmma_bf16f32_16x16x32_bf16 warp-gemm test (#8035) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Adds the warp-gemm unit test for `wmma_bf16f32_16x16x32_bf16`. Stacked on #8028 (the API change) and based on its branch, so #8028 shows the isolated API diff and this PR shows just the test. ## Test gfx125-guarded `WmmaBf16f32.ResidualPrecisionContrast`: computes `Y_bf16 = X_bf16·W_bf16 + R_fp32` via `WarpGemm::mac_downconvert`, compares against an fp32 reference (within bf16 tolerance), and asserts it is at least as accurate as the bf16-accumulate path — i.e. it demonstrates the precision benefit of the fp32 accumulator (`C`) carried into the fused bf16 down-convert. Passes on gfx1250.	2026-06-09 15:52:45 +00:00
Emily Martins	97ca00e449	[rocm-libraries] ROCm/rocm-libraries#7836 (commit cdd9958) [CK Tile] Stream-K RDNA Support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently, CK Tile Stream-K only supports CDNA architectures. This change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures. ## Technical Details Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear, and 3) tree. The linear and tree reductions require inter-workgroup communication to a global flags buffer and a global partials buffer. To ensure cache coherency, we use cache modifiers to skip cache levels that are not visible to all workgroups. On CDNA architectures, scalar load and scalar store instructions are available, which we use to read and write to the flags buffer with appropriate cache skipping modifiers. However, RDNA architectures do not support scalar store instructions, so workgroups must use a buffer store instruction to write to flags. Additionally, cache modifiers differ between CDNA and RDNA; they also differ between RDNA3 and RDNA4. Given this information, the main changes are as follows: - Added RDNA flag signaling: Use buffer store instructions for writing to global flags buffer - Add appropriate cache modifiers for reading and writing to flags and partials: - RDNA3 (gfx11): Use `glc \| dlc` coherence flags - RDNA4 (gfx12): Use `DEVICE` coherence scope - SFINAE-guarded overloads: Added compile-time dispatch for `SignalStorePartialDone()` and `WaitStorePartialDone()` based on target architecture - RDNA alignment requirements: Increased flags buffer alignment from 128B to 256B due to RDNA cache line size A note about the `amd_buffer_coherence_enum`: - Problem: The `amd_buffer_coherence_enum` uses preprocessor conditionals (`#if defined(__gfx12__)`) to define architecture-specific values. Template specializations reference enum values from different architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name lookup, non-dependent names are resolved during template parsing regardless of which architecture is being compiled, causing compilation failures when referenced values do not exist in the active preprocessor branch. - Temporary Solution: Added compatibility enum values to each architecture block. For example, I added `glc_dlc` in the `__gfx12__` block. I will create a ticket to refactor this enum with a design that has better scalability and tries to avoid the use of preprocessor conditionals. ## Test Plan ### Summary gtests were added to test wmma variants of Stream-K. These tests were stressed tested locally on gfx11 and gfx12. ### More details This PR makes the following changes/additions to the Stream-K gtests: - Split tests into MFMA (CDNA) and WMMA (RDNA) variants - Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction - WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size supported on RDNA) - Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU count - As described in [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898), when in Workgroup Processor (WGP) mode, the value of `hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single WGP contains two CUs. The default mode on RDNA is WGP mode, so when creating (M, N, K) instances for gtests using the CU count, we need to multiply the CU count by 2 to get the correct value. This is not needed in the kernel host code, because the occupancy ensures that overall `max_active_wgs` is correct. ## Test Result All tests pass locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-08 22:48:10 +00:00
Enrico Degregori	1b4fbd95fd	[rocm-libraries] ROCm/rocm-libraries#6089 (commit c876d18) [CK Tile] Extend type support EightWave pipeline ## Motivation EightWave pipeline was designed for 8 bit types. This PR extend support for any FP type ## Technical Details - Generalize policy to support any FP type - Change LDS layout to fix bank conflicts. This removes all bank conflicts in the pipeline (checked for all supported types). Remaining bank conflicts are related to Cshuffle epilogue. ## Test Plan Added GEMM tests with new supported types. Note that FP6 is also supported for MX GEMM but the PR was reverted so no tests were added for it. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 23:54:40 +00:00
Yung-sheng Tu	e826b2eb7e	[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds two specialisations related to `tf32_t`. ## Technical Details This change treats `tf32_t` as a concrete type rather than an empty `struct`. It also adds two new specialisations for MFMA dense builtins and resolves existing circular include issues. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 12:27:41 +00:00
Sami Remes	ad4e2e7624	[rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320) [CK_TILE] [QuantGEMM] Fix SplitK tail handling and other improvements (#7199) This pull request introduces improved and more robust split-K support for quantized GEMM. The main changes add runtime validation, utility functions for split-K batch calculations, pointer offset handling for split-K in grouped kernels, and enhanced support for various tensor layouts. The changes also improve error handling and provide more flexibility for runtime tail handling in split-K pipelines. Split-K Support and Validation Enhancements: * Added runtime validation to ensure `k_batch` is a positive integer and that split-K configurations do not produce empty final batches or mismatched pipeline tails, with detailed error messages and logging for misconfiguration. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250) * Introduced utility functions `get_splitk_batch_k_read` and `get_splitk_last_batch_k` to compute per-batch K read sizes and handle split rounding, ensuring correct and consistent split-K batch partitioning. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R206-R234) [[2]](diffhunk://#diff-635b89bdffa96b2b42f1632520cde36701d7d631e864185591f6b32f7645cf47L104-R107) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L388-R417) [[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250) * Changed the default value of `k_batch` in `QuantGemmHostArgs` to 1 (no split-K) for safer default behavior. Pointer Offsets and Grouped Kernel Handling: * Updated `QuantGroupedGemmKernel` to apply split-K per-batch offsets to all input pointers, mirroring the behavior of non-grouped kernels and ensuring correctness for split-K launches. * Modified AQ tensor view handling to correctly reflect the remaining K-groups from the split-K batch's offset position, improving accuracy for split-K in grouped kernels. Pipeline and Layout Flexibility: * Added support for runtime selection of split-K tail handling via a new template parameter `RuntimeSplitKTail_`, with new helper methods to dispatch GEMM pipelines accordingly. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R273) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1496-R1567) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1427) [[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1447-R1629) [[5]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1459-R1641) * Improved handling for tensor layout cases, including preshuffled B and both row-major and column-major AQ layouts, ensuring correct pointer arithmetic and compatibility checks. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R438-R454) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L464-R516) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)	2026-06-05 11:41:49 +00:00
Enrico Degregori	7b9245f18c	[rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d) [CK Tile] Async support preshuffle GEMM ## Motivation Add async support to existing preshuffle GEMM pipeline ## Technical Details Notes: the implementation avoids previous strategy of duplicating pipelines for async support and instead add a switch `Async` to the ops Problem to enable async pipeline. Then, integrate the async pipeline in the existing one. This allows to avoid code duplication and facilitate the integration of buffer load to lds in existing pipelines. In my opinion, it should be used also for other pipelines which don't support buffer load to lds yet and it would also be a good idea to refactor the existing async GEMM pipelines with the same approach. Summary: - integrate buffer load to lds in existing pipeline - add optimal tensor descriptors for vmem loading and lds reading. They are currently optimized for 16x16 wave tiles but they also work for 32x32 wave tiles. Optimizations for 32x32 wave tile requires different lds layout and it will be done in a follow-up issue - Add async config to examples - Add test (gfx950 only) ## Test Plan New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 07:17:09 +00:00
Enrico Degregori	bdd7a8333d	[rocm-libraries] ROCm/rocm-libraries#6672 (commit bda3f97) [CK Tile] PermuteN support MX GEMM ## Motivation Add PermuteN support to preshuffle MX GEMM ## Technical Details - Modify `shuffle_b_permuteN` to support MX preshuffled layout - Add `preShuffleScalePermuteN` with same functionality of `preShuffleScale` but layout consistent with PermuteN - Include MX pre-processing functions in the library ## Test Plan Add test configuration for permuteN with preshuffle (both FP4 and FP8) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Cong Ma <congma13@amd.com>	2026-06-05 03:04:43 +00:00
John Afaganis	96c39b331e	[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da) [ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility (#7829) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary CK source files must be compilable via hipRTC (HIP runtime compilation), whose preprocessor does not accept non-ASCII bytes anywhere in a translation unit — including in comments. Bytes that are harmless under `hipcc` (em-dashes, smart quotes, multiplication signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at preprocessing time. These regularly leak in via LLM-assisted authoring or copy/paste from formatted documents and silently break hipRTC paths that are not exercised by the default `hipcc`-based build matrix. This PR (a) cleans every existing violation (53 files) and (b) adds a pre-checkin gate so new violations are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). This is a strict superset of the existing `Clang Format` stage's predicate — `.inc` is added so test-fixture include files are also gated. The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage* in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `79798aa6261` — `[ck] Convert reflect/ rendering to ASCII for hipRTC compatibility` Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `\|- / +- / \| ` (3-col width preserved so alignment is unchanged). `conv_description.hpp` swaps `×` for `x` as the dimension separator. `test_conv_description.cpp` expected strings updated in lockstep so the snapshot test stays green. This is the only commit in the series with observable runtime impact. 2. `738fdb0d81c` — `[ck] Strip non-ASCII bytes from C++ sources for hipRTC compatibility` Mechanical text cleanup across 53 files. Replacements happen in comments or in `std::cout` strings that are not asserted on by any test. None of the 174 `.inc` files in the tree required edits, but they were in the scan's predicate so the enforcement stage's predicate is a superset of what was scanned. Full replacement table in the commit message. 3. `1d7cd8ba235` — `[ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility` - New `projects/composablekernel/script/check_ascii_only.sh` (modeled on `check_copyright_year.sh`). - New entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `ASCII Only Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the existing `Clang Format` stage but with `.inc` added to the find predicate. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 50 known violations; commit 2 leaves 0; commit 3 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains non-ASCII bytes: 1:// em-dash test — here Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.) $ echo $? 1 ``` Full repo scan after the cleanup commits (note the `-name '.inc'` clause): ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_ascii_only.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check` stage runs green over the full predicate (incl. `*.inc`) and existing `Clang Format` stage is unaffected. - [ ] `test_conv_description` passes against the ASCII tree-formatter output (touched in commit 1). - [ ] Local: `pre-commit run ascii-only-checker --all-files` runs cleanly after installing CK pre-commit hooks via `script/install_precommit.sh`. - [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. - [ ] Spot-check a representative subset of touched files under hipRTC compilation to confirm no remaining hipRTC-blocking content (optional, since the static byte check is a sufficient condition for hipRTC preprocessor acceptance on this dimension). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-04 15:00:17 +00:00
Aviral Goel	e01603bc31	[rocm-libraries] ROCm/rocm-libraries#7725 (commit eef7e12) [GFX1250][CK_TILE] Add scale16 warp gemm unit tests ## Summary - Add scale16 WMMA intrinsic overloads and int64_t forwarding to warp gemm layers for gfx1250 - Add comprehensive wave-level unit tests for scale16 warp gemm (16x16x128 and 32x32x128 tile sizes) - Test all fp8/bf8 type combinations and TransposeC variants - Fix WarpGemm wrapper for non-uniform scale16 configurations Stacked on #7724 (FillUniformScaleDistribution / MX GEMM scale init). Pipeline enablement follows in the next PR.	2026-06-03 22:05:29 +00:00
chris-tsiaousis-hpc	db05d61136	[rocm-libraries] ROCm/rocm-libraries#6212 (commit ccee58d) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR solves several issues: #### More accurate tests for MmaPipelines The current tests for the MmaPipelines (test_amdgcn_sparse_mma, test_amdgcn_wavewise_mma) use explicit input fragment vectors filled with 1s, and only check the output of a single lane. We should have tests that actually use the MmaPipelines with non-trivial input matrices and verify the complete output. Some other aspects of the current MmaPipelines tests that I noticed and deserve some attention: 1. There is sometimes iteration over K outside of the pipeline, which is then included in WaveTileK or FragK, which is not correct. We should remove it, move K iteration inside of the pipeline, or be more clear about this outer-K loop size and how it propagates downwards. 2. There is very tight coupling between the kernel, gtest code, and test_pipeline helper, requiring a lot of information and functions to be passed back and forth. 3. The test_pipeline helper is doing a bunch of register-related logic on the host (related to point 1) 4. Without this register logic the only thing it does is check the device, call the kernel, and check the output, but with a lot of boilerplate. #### Test helper for detecting target arch at HOST runtime There is a really apparent issue we faced while writing tests: Scenario: 1. Compile a test that supports both gfx950 and gfx1201 for gfx950 2. Run the test on a server that only has gfx1201 GPU Actual: Segmentation fault Expected: The test can correctly detect from HOST runtime that the DEVICE target_id was different and skips the test. Notes: The only way of detecting the COMPILER_TARGET_ID in the existing "arch" framework is launching a kernel and calling `get_compiler_target()` (so, from a DEVICE code). This will create a segmentation fault if the current arch differs from the target arch. To cope with this issue, we propose to export the compiler target(s) (note they can be many) through `projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and define a test helper to deal with such cases. #### Add composition support to Transforms We have a small number of Transforms which act on MmaOp input and output data, before and after the MmaOp call respectively. These are currently implemented to work on an MmaTile level, but in theory they are also supposed to work at a WaveTile level, i.e. after composition of multiple MmaTiles to create larger effective MNK dimensions. Currently the composed MmaTiles look like 2D C-style arrays of the individual MmaTile level register vectors (see WaveWiseMmaPipeline). The transforms should be able to take these and perform the proper transforms to the whole WaveTile at once. This might allow for better performing transformations. Note: This PR handles the SparseTransform case and if we don't end up doing scale as a transformation, there isn't really much left to do. If we end up having only the sparse transform as a non-trivial transform, then we could also consider removing the Transform framework.	2026-06-03 14:35:18 +00:00
Ville Pietilä	88f8d24c34	[rocm-libraries] ROCm/rocm-libraries#7936 (commit 3dc91e6) [CK Tile] Fix V6 pipeline applicability and split-image initialization (#7936) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation After adding code generation via CK Tile Dispatcher, some fwd and bwd weight tests for CK Tile convolutions are failing. This PR introduced correct applicability checks and fixes the split-image parameter initialization such that non-applicable instances are not invoked during test execution and split-image instances are correctly initialized. ## Technical Details Investigation revealed two distinct problems 1. For bwd weight, the compute V3 uses prefetch of 3 distinct tiles, which works incorrectly when the number of K-slices addressed by the workgroup is 1. This occurs when a large split-K value is used for a problem that results in a small Gemm-K value. 2. For fwd direction, the current CK Profiler/test infrastructure doesn't initialize the split-image parameters for instance where split-image is enable. Uninitialized split-image values result in non-deterministic behavior where the tests might randomly fail. Fixed problem 1. by adding a check in `IsSupportedArgument` that marks the instance invalid if the `num_loops = ceil(GemmK / (k_batch * KPerBlock)) < 4` for V6 pipeline kernel instances. The check is compile-time eliminated for other kernels. Fixed problem 2. by adding initialization of split-image parameters when split-image is enabled. The default initialization corresponds to full image with no split, i.e., the number of splits is 1 and it has the size of the full image. Added unit tests for the added logic. ## Test Plan Running the following test suites cover the logic added in this PR - test_grouped_convnd_fwd_tile - test_ck_tile_grouped_conv_fwd - test_grouped_convnd_bwd_weight_tile - test_ck_tile_grouped_conv_bwd_weight All test suites above are included in the automated test runs. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 08:40:03 +00:00
Anton Gorenko	7ecbf82708	[rocm-libraries] ROCm/rocm-libraries#7500 (commit f5cd4fd) [CK_TILE][FMHA] Optimize long-context decoding on gfx11/12 (#7500) ## Motivation Relevant issue: ROCM-22065 FMHA has less-than-optimal performance of long-context decoding (i.e. when seqlen_q = 1) on gfx11/12. This PR optimizes the splitkv pipeline and configs for such scenarios. ## Technical Details Optimizations applied in this PR: 1. use tiles with smaller M0 (16 vs 64), these tiles are used when seqlen_q <= 16 2. adapt qr_nwarp_sshuffle pipeline for gfx11, it allows to use more warps even for M0 = 16 (the qr pipeline parallelizes work between warps in M dim so with M0 = 16 it allows to use only 1 warp) 3. enable kMergeNumHeadGroupsSeqLenQ (an optimization that merges one group of heads in GQA) for all hdim values, not only 128 4. increase the number of splits (multiply by the number of head groups) if (3) is used 5. increase the number of splits for RDNAs (`multiProcessorCount` is the number of WGPs on RDNAs, not CUs, so it should be doubled to have meaning similar to CDNAs) Performance on gfx1151: \| Case \| develop (GB/s) \| This PR (GB/s) \| \|:-------\|-------:\|-------:\| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/45056, d:64/64 \| 127.58 \| 183.11 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/45056, d:64/64 \| 153.64 \| 215.02 \| \| [fp16\\|group\\|bshd] b:1, h:16/8, s:1/77184, d:128/128 \| 120.51 \| 225.76 \| \| [fp16\\|group\\|bhsd] b:1, h:16/8, s:1/77184, d:128/128 \| 130.62 \| 223.84 \| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/9600, d:128/128 \| 82.65 \| 138.44 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/9600, d:128/128 \| 105.75 \| 220.45 \| \| [fp16\\|group\\|bshd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.27 \| 187.89 \| \| [fp16\\|group\\|bhsd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.28 \| 188.19 \| ## Test Plan An additional test case is added to the exiting test. It uses seqlen_q = 1, GQA, no mask to trigger the changes ``` ninja test_ck_tile_fmha_fwd_fp16 && bin/test_ck_tile_fmha_fwd_fp16 --gtest_filter="SplitKV ninja test_ck_tile_fmha_fwd_bf16 && bin/test_ck_tile_fmha_fwd_bf16 --gtest_filter="SplitKV ``` Manual testing can be done with these commands: ``` bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=64 -s=1 -s_k=$((352 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=16 -h_k=8 -d=128 -s=1 -s_k=$((603 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=128 -s=1 -s_k=$((75 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=8 -h_k=1 -d=256 -s=1 -s_k=$((3133 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 ``` ## Test Result All the tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 06:16:10 +00:00
Aviral Goel	99ab4c4ef7	[rocm-libraries] ROCm/rocm-libraries#7830 (commit 590fe58) [CK_Tile][MI450] Add bf16 output wmma instruction (16x16x32) (#7830) Wire __builtin_amdgcn_wmma_bf16_16x16x32_bf16 into CK Tile for gfx1250, enabling bf16-input bf16-output WMMA at the warp GEMM level. - Add WmmaTraits specialization for <gfx125_t, bf16, bf16, bf16, 16,16,32> - Add WarpGemmAttributeWmmaImpl typedef and WarpGemmWmma alias - Add Dispatcher entry for bf16->bf16 16x16x32 - Add warp_gemm test with reference GEMM validation ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-02 13:54:16 +00:00
Tianyuan Wu	22a99f97e8	[rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93) [CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile ## Motivation This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and improves warp GEMM unit test coverage/structure for gfx1250-specific cases. ## Technical Details - Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added IsScale16 plumbing to warp GEMM dispatcher path - Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250 coverage to verify all F4 WMMA paths ## Test Plan Run ./test_ck_tile_wg_32x16x128_fp4. ## Test Result ``` ./test_ck_tile_wg_32x16x128_fp4 [----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (1751 ms total) [ PASSED ] 3 tests. ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-30 01:28:48 +00:00
Emily Martins	95c916369c	[rocm-libraries] ROCm/rocm-libraries#7584 (commit 060bad5) [CK_TILE] Fix Stream-K k_size calculation ## Motivation In a recent benchmarking task for CK Tile Stream-K algorithm, we identified that certain instances segfault. This change works to fix the bug and adds necessary regression tests. ## Technical Details The StreamK kernel constructs tensor views using a `k_size` parameter that determines how much of the K dimension to process in each iteration. Previously, this was calculated as: ```cpp index_t k_size = num_loop_sk * TilePartitioner::KPerBlock; ``` This calculation assumes all macro tiles along K are exactly `KPerBlock` in size. However, when `K % KPerBlock != 0`, the final macro tile along K has a remainder size of `K % KPerBlock`, not a full `KPerBlock` (see the figure below): <img width="961" height="488" alt="image" src="https://github.com/user-attachments/assets/3e1cceed-5dcd-4980-8b02-cee24eecf262" /> With the old code, a workgroup working with the `MPerBlock x (K % KPerBlock)` tile in A and B risk accessing illegal memory. Hence, this change ensures that when `K % KPerBlock != 0`, workgroups processing iterations that include the final macro-tile along K calculate the correct `k_size` based on the remainder rather than assuming a full `KPerBlock`. ## Test Plan I added the following tests: 1. Unit tests added for the Stream-K Tile Partitioner: - `StreamKTilePartitionerBaseGetKSize/NoRemainderTiles` - validates full tiles - `StreamKTilePartitionerBaseGetKSize/RemainderTiles` - validates remainder handling 2. Regression tests that test a case where `K % KPerBlock != 0` ## Test Result Tests passed locally on gfx90a, gfx942, and gfx950. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 21:36:49 +00:00
Aviral Goel	15c904b460	[rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a) ck_tile: add FillUniformScaleDistribution and fix MX GEMM scale init (#7724) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Problem MX GEMM pipeline tests were passing vacuously: scale bytes were drawn from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so numerical checks passed without exercising the GEMM. ### Changes `include/ck_tile/host/fill.hpp` — new `FillUniformScaleDistribution<ScaleType>` functor - Accepts human-readable float bounds and maps them to the raw byte range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE 754 exponent into the type's bias space - Sampling is uniform over raw bytes → uniform over representable values - Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to avoid shifting negative signed integers (C++17 UB) - Adds `assert(min_r <= max_r)` to catch inverted-range UB when both bounds exceed the type's representable range - Provides default member values (0.125f, 2.0f) and `std::optional` seed consistent with sibling fillers - `/** /` Doxygen style with `@note` on snapping asymmetry `test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`* — fix scale initialization - Replace manual byte-range distribution with `FillUniformScaleDistribution<>{0.125f, 2.0f}` - Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid correlated scale tensors that were causing 60 test failures for fp4+e5m3/e4m3 combinations `test/ck_tile/utility/test_fill.cpp` — new unit tests for `FillUniformScaleDistribution` - 16 typed tests across e8m0, e4m3, e5m3: validity, range, reproducibility, coverage, snapping, stress, nullopt seed, and range overload - Test helper `expected_raw_range` mirrors implementation clamping exactly	2026-05-29 18:45:13 +00:00
Andriy Roshchenko	d5c9215064	[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f) [CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC ## Motivation Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async load instruction for tensor A, which complements the existing MX GEMM FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines are enabled on gfx1250. ## Technical Details The existing gfx950 implementation was extended to support gfx1250 architecture. All three MX FP data types are supported across the two ASICs. It should be noted that while the TDM pipeline uses an emulated 32x32x128 warp-tile instruction, the present submission relies on the built-in 16x16x128 instruction, called 4 times per warp. ## Test Plan Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250 functionality. To help facilitate the testing in development, `example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was introduced to verify various combinations of supported data types and pipeline versions. ## Test Result The present submission is expected to work on both gfx950 and gfx1250 hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. - [x] Relies on #6978 and should only be merged after the changes are merged to the `develop`.	2026-05-29 17:02:45 +00:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
assistant-librarian[bot]	6181eb2adf	[rocm-libraries] ROCm/rocm-libraries#4279 (commit 5b3f4b7) [CK_TILE] Stream-K XCD remapping (#4279) ## Proposed changes This PR adds support for XCD remapping as detailed in this [document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A). On gfx942, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this [PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM. Note: This only applies to the gfx942 as they introduce the XCDs. Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged --- 🔁 Imported from [ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652) 🧑‍💻 Originally authored by @arai713 --------- Co-authored-by: Astha <astha.rai713@gmail.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>	2026-05-26 09:43:03 -07:00
Yung-sheng Tu	760f9e1d0a	[rocm-libraries] ROCm/rocm-libraries#7104 (commit 0fab8d8) [CK TILE] Unification Work – Add MFMA specialisations for `fp64_t` (#7104) ## Motivation This PR adds two specialisations related to `fp64_t`. ## Technical Details This adds two new specialisations for MFMA dense builtins, and adjusts ABLayout and CLayout to L{K1BM} and L{M1BN}. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-26 10:49:36 +00:00
JP-Fernando	74bc86240b	[rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a) [CK Tile] Add gemm universal preshuffle to MX GEMM (#5647) ## Motivation Add gemm universal preshuffle support to existing MX GEMM pipeline. The straightforward way to do this is to port the `mx_flatmm` pipeline to the existing `gemm_mx` framework. ## Technical Details The `mx_flatmm` pipeline was not deleted, to allow for back-compatibility. ## Test Plan Add `preshuffle` option to example: `tile_example_mx_gemm`. Add new configurations with enabled preshuffle to the existing `test/ck_tile/gemm_mx` tests. ## Test Result Example and tests were successful on `gf950` architecture in the `Alola` cluster. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-05-22 16:07:53 +02:00
Wojciech Laskowski	3ea9ce7e37	[rocm-libraries] ROCm/rocm-libraries#6567 (commit 753c7a8) [CK Tile] Adding WMMA wrappers for sparse builtins (#6567) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the third of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801 and https://github.com/ROCm/rocm-libraries/pull/6014) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse WMMA intrinsics. ## Technical Details This change adds new specializations for WMMA sparse builtins. In total, we add 8 WMMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-22 13:34:33 +02:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
JiaLuo-CAN	5ff7497fa7	[rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4) [CK Tile] Fix Grouped Gemm quant mixed precision (#7537) <Migrate from Internal repo PR> test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8 cases: std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant, False, True, False>, std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant, False, True, False> GFX1250 would fail with incorrect results, GFX950 would fail when compiling BF8+FP8 and give incorrect results for FP8+BF8. The issue is due to the wrong ComputeDataType selection. The fix is to consider original ADataType and BDataType even when ComputeDataType is not void. For compiling error on gfx950, the bf8, fp8, 16x16x32 warp Gemm is added.	2026-05-21 08:36:23 -07:00
Wojciech Laskowski	275629fe34	[rocm-libraries] ROCm/rocm-libraries#6014 (commit 2f8259d) [CK Tile] Adding MFMA wrappers for dense builtins (#6014) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the second of the series of PRs (after #5801) that add all the necessary MMA builtins as `amdgcn_mma` structs. This PR focuses on dense MFMA intrinsics. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we add 55 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-21 09:05:19 +02:00
JH-Leon-KIM-AMD	720ceb6500	[rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f) [CK Tile] Support multi-vector reads in static encoding patterns (#7528) ## Motivation The thread-raked / warp-raked / block-raked static tile distribution patterns in `ck_tile` silently produce wrong results when the contiguous tile dimension is larger than `warp_size * vector_size`, because the encoding has no per-thread iteration dimension along X. Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in `ConvConfigComputeV3`, the grouped convolution backward-weight example reports about 50 percent wrong values, with errors starting exactly at the `X0X1 = 64` boundary. The second pass over the contiguous dim is never performed. This PR extends the encoding so multi-vector reads in the contiguous tile dimension are supported, while keeping every existing call site bit-for-bit identical. ## Technical Details Three files changed. ### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp` Add a per-thread X iteration dimension in all three raked specializations: - `X0 = min(warp_size, XPerTile / X1)` — threads in X dim - `X1 = min(LargestVec, VecSize)` — vector size per access - `X2 = XPerTile / (X0 X1)` — number of X-iters per thread (new) `X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both `make_2d_static_tile_distribution()` and `make_shuffled_2d_static_tile_distribution()`. The new encoding places `X2` in the middle of the Ys iteration list, which preserves reverse symmetry between the regular `<..., X2, X1>` and shuffled `<X1, X2, ...>` encodings. Patterns updated: `thread_raked`, `warp_raked`, `block_raked`. ### 2. `include/ck_tile/core/tensor/transpose_tile.hpp` Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch alongside the existing `NDimY == 2` branch. The original branch is byte-for-byte unchanged. Both branches dispatch to the same `transpose_tile2d_impl_in_thread`, whose body has always been NDimY-generic (iterates with `static_for<0, NDimY, 1>` and `number<NDimY>{}`). ### 3. `experimental/grouped_convolution_tile_instances/generate_instances.py` Removed the two now-obsolete skip guards in `parse_bwd_weight_instances` and `parse_bwd_data_instances`: ```python if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector): print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.") continue ``` Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating, irregular-load shapes, scalar-per-vector > tile size) are kept untouched. ### Compatibility Strict. Every existing caller has `X2 == 1` and therefore hits the original encoding path verbatim. No upstream config or pipeline behavior changes. ## Test Plan The grouped convolution example is the natural exerciser since `GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for both A and B tiles, and all three conv directions share the same `ConvConfigComputeV3`. For each test below we ran: ``` ./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}] ``` with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both code paths: \| Test \| M / N / K \| VecA/B/C \| A path \| B path \| dtype \| \|------\|-------------\|----------\|------------\|----------------\|-------------\| \| T1 \| 16/64/32 \| 4/8/4 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T2 \| 128/128/64 \| 2/2/2 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T3 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 \| \| T5 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 (3 dir)\| \| T4b \| 128/128/128 \| 1/1/1 \| new (X2=2) \| new (X2=2) \| fp16 + bf16 (3 dir) \| A larger T4a (256/256/128) was attempted to stress both A and B with X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128 KB > 64 KB limit), independent of this PR. For the generator change we ran: ``` python3 generate_instances.py --mode profiler --direction all ``` and verified `Skipping instance ... with multiple warps per continous tile dim` no longer appears (count went from non-zero to 0); other skip categories are unchanged. `clang-format-18` was applied to both modified `.hpp` files (matches the repo's `.clang-format`). ## Test Result - T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`. Confirms existing callers are unaffected. - T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3 encoding + transpose branch. - T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3 `correct`. - T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs `correct`. - Generator: 0 `multiple warps per continous tile dim` skips remaining; other skips unchanged. Sample run output (T4b, bf16, bwd_data): ``` shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32 pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0 Vector size A: 1, Vector size B: 1, Vector size C: 1 0.934907 ms, 8.34683 TFlops, 34.3178 GB/s Relative error threshold: 0.00390625 Absolute error threshold: 0.25 The CPU verification result is: correct ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-20 17:25:22 +03:00
Kiefer van Teutem	b5f8bef97f	[rocm-libraries] ROCm/rocm-libraries#6088 (commit 6ac353c) [CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088) ## Motivation This MR makes all the changes required for the unified architecture to be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The crux is that layout parameters should be interpreted as describing the pure mathematical matrix fragments, while the ext_vectors and tile distribution encodings describe everything in terms of packed datatype units. This matches how packed types are dealt with in ck_tile and should play nicely with the load and store tile ops once we integrate the unified framework into CK tile. The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and pk_bf6x32_t, which did not exist before. The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size 1 and 2) were extended to make the subscripting operator work as expected. The layout test was adapted to be compatible with all packed datatypes, and all new intrinsics were added to the test. This MR adds ALL intrinsics across ALL architectures which use packed datatypes, as well as ALL scale intrinsics: mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) wmma_i32_16x16x16_iu4_w32 wmma_i32_16x16x16_iu4_w32_gfx12 wmma_i32_16x16x32_iu4_w32_gfx12 ## Testing All intrinsics were tested on all architectures.	2026-05-20 12:36:13 +00:00
Enrico Degregori	9565ca21ec	[rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2) [CK Tile] Eight Waves pipeline for MX GEMM (#5552) ## Motivation Integrate Eight Waves pipeline in MX GEMM ## Technical Details - EightWaves pipeline: - Add pipeline, policy and block gemm (internally using existing implementation used by GEMM and ABQuant) - Extend support of EightWaves policy for FP4 (packed types) - Async pipeline: - Fix pipeline with packed scales (requires MRepeat and NRepeat to be contiguous) - block gemm specific for MX GEMM is defined because distribution encodings have changed - CShuffle: - Add new functionality to support MRepeat and NRepeat contiguous (defined by `TilesPacked`) - Examples: - Refactor examples to easily switch different configurations (similar to GEMM universal) - Scales values generated consistently with other microscale implementations in CK Tile - Add configuration for EightWaves pipeline - Tests: - Unify existing FP8 and FP4 tests - Add tests for EightWaves pipeline - Scales values generated consistently with other microscale implementations in CK Tile Note: FP6 support for MX GEMM was added later and the support for the Eight Waves pipeline will be done in following PR ## Test Plan Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and FP8 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-19 11:53:19 -07:00
Yung-sheng Tu	3ccb72e761	[rocm-libraries] ROCm/rocm-libraries#6207 (commit cc56378) [CK TILE] Unification Work – Add `print()` Utility to `MmaOpTraits` (#6207) ## Motivation It would be useful to have a `print()` utility inside of unification work's code scope, so that we can print all template params and derived params of `amdgcn_mma` for easier debugging. ## Technical Details Adding helper functions and struct to traits, adding `print_flags()` for each `DefaultCtrlFlags`, `amdgcn_target` and `MmaOpTraits` structs, and adding `print()` for `amdgcn_mma`. Note: the first commit is not* in the scope of this PR. This PR should be merged after https://github.com/ROCm/rocm-libraries/pull/5801 and https://github.com/ROCm/rocm-libraries/pull/5857. ## Test Plan Adding test in layout test. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-18 13:02:38 +02:00
Illia Silin	df07f060c1	[rocm-libraries] ROCm/rocm-libraries#7471 (commit 13b9eec) [CK] increase timeout limit for fmha_fwd tests to avoid CI failure on gfx11 (#7471) ## Motivation This should prevent fmha_fwd tests from timing out on one of the slower gfx11 CI nodes and generating false CI failures. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-15 16:41:23 +00:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Linjun-AMD	5003f7ef8a	[rocm-libraries] ROCm/rocm-libraries#7272 (commit d02f3c0) [ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner (#7272) ## Summary In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with first dimension `shape_batch` (= 1 in group mode), but the reference forward loop accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >= 1` this is an out-of-bounds heap read, silently corrupting the reference forward math chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc` reference into non-deterministic garbage. `HostTensor::operator()` does not bounds check, so the OOB is not caught at runtime. This manifests as intermittent `tile_example_fmha_bwd` failures (25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group mode), with bit-exact but spurious `max_err` values like 4.27 / 14.6. ## Fix One-line: allocate `sink_host` with `batch` (the real per-batch dim) instead of `shape_batch`, mirroring how `sink_host` is accessed by the loop. ```diff - sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead} + sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead} Repro tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \ -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \ -v=3 -mode=1 -kname=1 -sink_grad=1 Verification - 0/30 fail on the repro config after fix - Baselines (before fix): - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4) - sink=1, mask=t: 67% fail rate (p ≈ 6e-15) Attribution Shape bug introduced together with sink_grad in #5504. Unrelated to #6914 (which is a fwd-only fix on a different code path) ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: junlin12 <junlin12@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-05-13 16:47:50 +08:00
Wojciech Laskowski	640bd560ec	[rocm-libraries] ROCm/rocm-libraries#5801 (commit 27f6d15) [CK Tile] Adding WMMA wrappers for dense builtins (#5801) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the first of the series of PRs that add all the necessary MMA builtins as a `amdgcn_mma` structs. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we have now 9 RDNA4 builtins and 3 RDNA3 builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com>	2026-04-27 11:57:51 +00:00
Sami Remes	de3fa71992	[rocm-libraries] ROCm/rocm-libraries#6611 (commit 5375c0f) [CK_TILE] Preserve input strides in EightWaves async-load descriptor (#6611) `MakeAsyncLoadADramWindow` in `GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes strides from lengths and assumes a dense layout. When the input view's leading-dim stride is larger than its inner length (non-packed memory layout), the resulting tile window stepped through memory at the wrong stride. Compose the unmerge transforms on top of the input view's existing descriptor instead, so the actual runtime strides are preserved and the correct `element_space_size` is inherited for bounds checking. ## Test Plan Added an unit test showing the problem. ## Test Result The new test fails before fixes and passes after. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 12:52:02 +02:00
Yung-sheng Tu	5d36cad34a	[rocm-libraries] ROCm/rocm-libraries#5857 (commit d77cd41) [CK TILE] Unification of Scale MFMA/WMMA Policy Structs (#5857) ## Motivation The existing unification work supports DENSE and SPARSE intrinsics. In this PR, we enable support for SCALE intrinsics and add example SCALE implementations. ## Technical Details Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE intrinsics, and adding WMMA SCALE policy trait. Note: fp6 SCALE intrinsics support is not included in this PR, as its handling in ck_tile is currently more specialized and does not follow the same pattern as other datatypes. ## Test Plan Added new tests for the relevant SCALE specialisations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:28:23 +00:00
Max Podkorytov	3aee45e115	[rocm-libraries] ROCm/rocm-libraries#5383 (commit b660b8c) [CK_TILE] Add CShuffleLds microbenchmark suite (#5383) ## Summary Microbenchmarks isolating LDS store/load operations in CShuffleEpilogue for bank conflict analysis. ## Motivation CShuffleEpilogue performs LDS store (MFMA registers → LDS) and load (LDS → registers for coalesced global writes). This suite isolates each operation to: - Identify which operation causes bank conflicts - Measure pure LDS bandwidth per access pattern - Validate access patterns across MFMA tile sizes and wave layouts ## Components - Microkernels (`tile_load_store_microkernels.hpp`): `StoreTile<Setup>`, `LoadTile<Setup>` - Setup Adapters (`benchmark_cshuffle_lds.hpp`): Wire CShuffleEpilogue to microkernels - Template (`benchmark_template.cpp.in`): Generated benchmarks with timing ## Build ```bash cmake -G Ninja -B build -S . \ -DGPU_TARGETS=gfx950 \ -DBUILD_CK_EXAMPLES=ON \ -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON ninja -C build bench_lds_fp8_16x16x128_2x2_fp8 ``` ## New CMake Options \| Option \| Default \| Description \| \|--------\|---------\|-------------\| \| `BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS` \| OFF \| LDS microbenchmarks \| \| `BUILD_CK_TILE_FMHA_TESTS` \| ON \| FMHA tests \| \| `BUILD_CK_TILE_ENGINE` \| ON \| Tile engine \| \| `BUILD_CK_TILE_ENGINE_TESTS` \| ON \| Tile engine tests \| \| `BUILD_CK_EXAMPLES` \| ON \| Examples \| \| `BUILD_CK_TUTORIALS` \| ON \| Tutorials \| \| `BUILD_CK_DEVICE_INSTANCES` \| ON \| Device instances \| \| `BUILD_CK_PROFILER` \| ON \| Profiler \| Setting guards to OFF reduces cmake configure from ~150s to ~5s. --------- Made-with: Claude Code, Opus 4.5	2026-04-14 20:43:23 -07:00

1 2 3 4 5 ...

334 Commits