composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 03:07:02 +00:00

Author	SHA1	Message	Date
Anton Gorenko	8428732dc2	Support multiple D in quant gemm kernel A new kernel is added QuantGemmMultiDKernel, the existing QuantGemmKernel behaves as usual.	2026-06-16 11:42:27 +05:00
Anton Gorenko	335f80033b	Impove precision of CShuffle with scales or multi D Two new template parameters are introduced: * CShuffleDataType allows to apply multiple Ds before downcasting to ODataType (prevents unexpected precision loss and/or overflow); * CComputeDataType to use scales with int32 AccDataType (int8 gemms);	2026-06-16 11:41:20 +05:00
Andriy Roshchenko	b8440b3aeb	[rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6) [GFX1250][MX GEMM] Unified FLATMM GroupedGemm Implementation for MX Data Types (#8325) ## Motivation Design and test a unified FLATMM GroupedGemm interface so that it supports all MX FP8, FP6, and FP4 data types on both the gfx950 and gfx1250 architectures and works seamlessly across these platforms. ## Technical Details Implementation exposes Grouped Gemm interface for MX FLATMM and MX TDM FLATMM pipelines. ## Test Plan Add the following tests: - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_non_tdm.cpp - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_tdm.cpp - ck_tile/flatmm/test_mx_flatmm_persistent.cpp Verify on the gfx950 and gfx1250 architectures. ## Test Result All tests pass. Verified on A0 hardware with rocm-7.14.0a20260517 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 16:12:33 +00:00
Sami Remes	c1f7104852	[rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01) [CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K, unsupported-shape guard (#6663) Three independent MX GEMM correctness bugs reported against example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one host-side atomic-add accumulation bug in the example's repeat loop. - Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop required num_loop > PrefetchStages, which let num_loop == 3 enter a hot loop that produced 5 gemm accumulations instead of 3 (K == 3K_Tile, e.g. K=768, deterministically wrong). Require num_loop >= 4 instead: pre-pipeline + TailNumber::Three already totals exactly 3. - Kernel (gemm_mx_kernel.hpp): split-K was silently broken because GridSize did not thread k_batch into blockIdx.z and the scale tile windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore read the wrong packed scales. Fix: GridSize returns dim3(grid_x, 1, k_batch) (persistent and non-persistent). * MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a packed-scale K offset (also apply pad_tensor_view so OOB scale loads return zero, matching A/B padding). * operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset (matches Underlying::SplitKBatchOffset's K1-aligned formula), and dispatches the epilogue with memory_operation_enum::atomic_add for k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard as UniversalGemmKernel. * MakeCBlockWindows templated on DstInMemOp; unconditionally applies pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are handled correctly. - Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp): add IsSupportedArgument and a static_assert for configurations that produce silent wrong results: * static_assert(!kPadK) -- the MX comp-async pipeline uses async_load_tile whose OOB check is per-vector-start, so a vector straddling the K pad boundary reads garbage. Until the async path learns per-element pad masking, reject kPadK at compile time. * Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always (CShuffleEpilogue cannot safely run with a single partial tile); K % (KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of WarpTile_K * k_batch so every split lands on a packed-scale boundary. * All error paths log under CK_TILE_LOGGING with actionable messages. - Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp): * Call Kernel::IsSupportedArgument up front and throw a clear runtime_error for rejected shapes (was silently launching an unsupported kernel). * Switch to launch_kernel_time_mask with a clear_gemm_output preprocess that zeroes C between iterations when k_batch > 1 (mirrors universal_gemm_invoker). Without this the default -warmup=50 -repeat=100 accumulated 150 atomic_adds into C after the kernel-side split-K fix. Tests (test/ck_tile/gemm_mx/): - Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true). - test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression), SplitK (k_batch=2,4 across full_k/partial_k paths), TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding} covering trailing partial tiles along M, N, or both. - Run(...) now takes k_batch. - packScalesMNxK: guard against OOB (mn, k) reads from src and initialise e8m0 bytes to the zero exponent (0x00) instead of the default-constructed NaN (0xFF), so padded lanes don't poison the packed int32_t shared with in-range lanes. - test_mx_gemm_instance.hpp: call IsSupportedArgument before launch. Verification on gfx950, ROCm 7.2.0: - ctest -R test_ck_tile_mx_gemm -> 100% (2/2). - Example sweep over the original bug-report shapes: all K-aligned shapes now validate correct (including 4096^3 sk=2 and the K=768 cases); all K=128 shapes cleanly rejected with the new error message instead of producing silent wrong results. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 08:28:55 +00:00
SamiAario-AMD	947dcc2606	[rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c) [CK Tile] Add transposed tile load implementation, and tests for load_and_convert_tile (#5510) ## Motivation Mixed precision b/fp16 x fp8 requires a transposed tile load implementation that supports mixed precision using these types. Implement this, use it in `load_and_convert_tile`, and add a unit test for `load_and_convert_tile` which covers this functionality. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 06:42:28 +00:00
Johannes Graner	01cca38c8e	[rocm-libraries] ROCm/rocm-libraries#8220 (commit 4c04a3a) [CK Tile] WAVELET pipeline for backward-data grouped convolution (#8220) ## Motivation On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was ~18% behind classic CK, with the gap concentrated in the K=2376 3x3 detection-head family where bwd_data spends most of its time. The WAVELET GEMM pipeline already gives uplift for forward and backward-weight conv; this ports it to backward-data and consolidates the now-shared machinery across all three directions. ## Technical Details - Backward-data wavelet support in the tile kernel: launch extra load waves when the pipeline exposes `LaunchBlockSize`, and split the epilogue into math waves (run the CShuffle epilogue) and load waves (`RunBarrierStub`). - Register 7 WAVELET instances (fp16 and bf16), tuned for backward-data's tall-skinny GEMM rather than the forward tile shapes: a big-M `256/128/64` workhorse, a `VecA=4` variant for the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped (depthwise-style) shapes. - Implement the native backward-data instance parser in `generate_instances.py`. - Deduplicate the wavelet machinery shared by forward, backward-data, and backward-weight: `GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and `RunWaveletAwareEpilogue` in `grouped_convolution_utils.hpp`; the three native instance parsers collapse to one parameterized parser. The three kernels now call the shared helpers. ## Test Plan - Rebuild the full profiler instance pools for all three directions (fp16/bf16/fp32, nhwgc/ndhwgc) to exercise the shared helpers across every instantiation. - Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`, `test_grouped_convnd_bwd_data_tile`, `test_grouped_convnd_bwd_weight_tile`. - Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK and the non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness spot-checked with GPU-reference verification on the new big-M and NumGroupsToMerge instances. ## Test Result - GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6. - Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 -> 1.11, i.e. the tile path goes from ~12% slower than classic to ~8% faster. The largest single backward-data shape (256x100x100->2376) moves from 11% slower than classic to 12.5% faster. - The dedup refactor preserves behavior (net -174 lines across the kernels/generator), confirmed by the full rebuild and the GTests above. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-13 00:10:50 +00:00
John Afaganis	329e589840	[rocm-libraries] ROCm/rocm-libraries#8260 (commit 1139236) [ck] Enforce LF-only line endings in C/C++ sources MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Several CK source files carry Windows CRLF line endings (a trailing carriage return on each line), introduced by editors configured for Windows endings or copy/paste from Windows tooling. These are purely cosmetic but they pollute diffs (whole-file churn the first time someone makes an LF edit), confuse `clang-format`, and are inconsistent with the LF-only convention used across the rest of the tree. This PR (a) normalizes every existing CRLF file (6 files) to LF and (b) adds a pre-checkin gate so new CRLF leaks are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate as the adjacent `ASCII Only Check` stage: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `[ck] Normalize CRLF line endings to LF in C/C++ sources` Mechanical line-ending cleanup across 6 files. No content change: every edit is purely CRLF -> LF, verified with `git diff --ignore-cr-at-eol` reporting an empty diff. 2. `[ck] Enforce LF-only line endings in C/C++ sources` - New `projects/composablekernel/script/check_no_crlf.sh` (modeled on `check_ascii_only.sh`). - New `crlf-checker` entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `CRLF Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the adjacent `ASCII Only Check` stage. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 0 CRLF violations; commit 2 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf 'int main() {}\r\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_no_crlf.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains CRLF (Windows) line endings: 1:int main() {}<CR> Fix: convert to LF, e.g. 'sed -i 's/\r$//' /tmp/bad.cpp' or 'dos2unix /tmp/bad.cpp' $ echo $? 1 ``` Full repo scan after the cleanup commit: ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_no_crlf.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> CRLF Check` stage runs green over the full predicate and the existing `ASCII Only Check` / `Clang Format` stages are unaffected. - [ ] Local: `pre-commit run crlf-checker --all-files` runs cleanly after installing CK pre-commit hooks. - [ ] Manually inject a CRLF line ending in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-12 21:11:59 +00:00
Wojciech Laskowski	c2601f38b7	[rocm-libraries] ROCm/rocm-libraries#6569 (commit 393049e) Adding amdgcn_mma specializations for sparse MFMA builtins (#6569) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the fourth of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801, https://github.com/ROCm/rocm-libraries/pull/6014 and https://github.com/ROCm/rocm-libraries/pull/6567) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse MFMA intrinsics. ## Technical Details This change adds new specializations for MFMA sparse builtins. In total, we add 27 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 12:48:29 +00:00
Enrico Degregori	e75076c826	[rocm-libraries] ROCm/rocm-libraries#8310 (commit 003bc6b) [CK Tile] Fix assert usage MX GEMM ## Motivation See issue https://github.com/ROCm/rocm-libraries/issues/8223 ## Technical Details - Use `std::runtime_error` in `mx_processing.hpp` - Use `static_assert` in `tensor_shuffle_utils.hpp` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 11:42:38 +00:00
jefyang1	276863ca87	[rocm-libraries] ROCm/rocm-libraries#8259 (commit df03f10) Add cluster launch in test ck_tile mx gemm tdm wmma ## Motivation Add cluster launch test in test_ck_tile_mx_gemm_pipeline_tdm_wmma on gfx1250, so that we can check the performance on gfx1250 hardware. ## Technical Details Added Out-of-bounds guard in RunGemm of MxGemmKernel to skip blocks padded by cluster alignment. Add ClusterEnable/ClusterDisable aliases and extend the tuple in test_mx_gemm_pipeline_kernel_types.hpp by adding two kernel types with ClusterEnable for F8 CompTDMV1 and CompTDMV2 respectively. The existing F4 non-ClusterLaunch kernel types have issue to be fixed, so this PR does not include F4 cases. Read ClusterLaunch from the tuple in test_mx_gemm_pipeline_util.hpp. Update invoke_mx_gemm to branch on ClusterLaunch, including Add cluster size constants, Switch GemmShape type, TilePartitioner type, and the kernel launch call. ## Test Plan Tested the changes on gfx1250 FFM. ## Test Result The added kernel types (instances) passed the tests on gfx1250 FFM. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 17:33:11 +00:00
Bartłomiej Kocot	0fdbf8a91d	[rocm-libraries] ROCm/rocm-libraries#8272 (commit 1c66ecb) [CK] Padding on K for global load for grouped conv bwd data (#8272) ## Motivation Fix incorrect results caused by lack of padding during global load in grouped convolution backward data kernel. It is needed since there is no OOB check for global load. ## Technical Details Add padding needed for global load which not use OOB check. ## Test Plan test_grouped_convnd_bwd_data* ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 15:28:21 +00:00
Bartłomiej Kocot	cb099eb963	[rocm-libraries] ROCm/rocm-libraries#8155 (commit c25787b) [CK] Magic division for long_index_t ## Motivation Improve performance for long_index_t kernels ## Technical Details Support magic division for long_index_t ## Test Plan test_grouped_convnd* ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1386	2026-06-09 20:57:20 +00:00
Emily Martins	97ca00e449	[rocm-libraries] ROCm/rocm-libraries#7836 (commit cdd9958) [CK Tile] Stream-K RDNA Support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently, CK Tile Stream-K only supports CDNA architectures. This change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures. ## Technical Details Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear, and 3) tree. The linear and tree reductions require inter-workgroup communication to a global flags buffer and a global partials buffer. To ensure cache coherency, we use cache modifiers to skip cache levels that are not visible to all workgroups. On CDNA architectures, scalar load and scalar store instructions are available, which we use to read and write to the flags buffer with appropriate cache skipping modifiers. However, RDNA architectures do not support scalar store instructions, so workgroups must use a buffer store instruction to write to flags. Additionally, cache modifiers differ between CDNA and RDNA; they also differ between RDNA3 and RDNA4. Given this information, the main changes are as follows: - Added RDNA flag signaling: Use buffer store instructions for writing to global flags buffer - Add appropriate cache modifiers for reading and writing to flags and partials: - RDNA3 (gfx11): Use `glc \| dlc` coherence flags - RDNA4 (gfx12): Use `DEVICE` coherence scope - SFINAE-guarded overloads: Added compile-time dispatch for `SignalStorePartialDone()` and `WaitStorePartialDone()` based on target architecture - RDNA alignment requirements: Increased flags buffer alignment from 128B to 256B due to RDNA cache line size A note about the `amd_buffer_coherence_enum`: - Problem: The `amd_buffer_coherence_enum` uses preprocessor conditionals (`#if defined(__gfx12__)`) to define architecture-specific values. Template specializations reference enum values from different architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name lookup, non-dependent names are resolved during template parsing regardless of which architecture is being compiled, causing compilation failures when referenced values do not exist in the active preprocessor branch. - Temporary Solution: Added compatibility enum values to each architecture block. For example, I added `glc_dlc` in the `__gfx12__` block. I will create a ticket to refactor this enum with a design that has better scalability and tries to avoid the use of preprocessor conditionals. ## Test Plan ### Summary gtests were added to test wmma variants of Stream-K. These tests were stressed tested locally on gfx11 and gfx12. ### More details This PR makes the following changes/additions to the Stream-K gtests: - Split tests into MFMA (CDNA) and WMMA (RDNA) variants - Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction - WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size supported on RDNA) - Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU count - As described in [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898), when in Workgroup Processor (WGP) mode, the value of `hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single WGP contains two CUs. The default mode on RDNA is WGP mode, so when creating (M, N, K) instances for gtests using the CU count, we need to multiply the CU count by 2 to get the correct value. This is not needed in the kernel host code, because the occupancy ensures that overall `max_active_wgs` is correct. ## Test Result All tests pass locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-08 22:48:10 +00:00
Márton Bidlek	0d18f4fc05	[rocm-libraries] ROCm/rocm-libraries#4798 (commit 0acaf5f) Using named functors instead of lambdas ## Motivation Currently, in block-level GEMM pipelines, there is significant code repetition for prefetching and tail handling, where lambda functions create a unique instantiations at each call. This includes repeated static_for instantiations and large loops such as MRepeat. Each repetition results in additional instantiations, which increases compilation time and binary bloat. ## Technical Details Refactor repeated code blocks into named functors so the compiler can reuse already instantiated code instead of generating multiple copies. Scope of changes: 1. WMMAOPS pipeline internals: projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_base.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v1.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v3.hpp 2. XDLOPS and preshuffle pipeline variants across projects/composablekernel/include/ck/tensor_operation/gpu/block (v1/v2/v3/v4/v5, scale, dequant, gufusion, moe, mx, blockscale, skip-b-lds, dpp, xdlops) Shared functor file: projects/composablekernel/include/ck/utility/vector_load_functor.hpp ## Test Plan Note that the provided compilation traces by -ftime-trace do not report unnamed lambda instantiations, so a clear baseline for instantiation counts cannot be established. As a result, the impact of this change will be evaluated based on runtime performance rather than direct instantiation-count comparisons. ## Test Result The effects of this were timed by the compilation of a single HIP object through an example (grouped_gemm_wmma_splitk_fp16.cpp). The average user time and speedup of this using the average of 100 compilations is: - Mean compile time before the changes: 37.734 s - Mean compile time after: 32.087 s - Speedup: 17.6% Ran a full CK compilation on Alola with the following results: \| Metric \| Before (min) \| After (min) \| Absolute Reduction (min) \| % Reduction \| \| ------ \| ------------ \| ----------- \| ------------------------ \|	2026-06-08 17:11:53 +00:00
Emily Martins	674f7cdc0e	[rocm-libraries] ROCm/rocm-libraries#8141 (commit d3defa6) [CK] Remove Stream-K from old CK ## Motivation Since Stream-K has a CK Tile implementation, we no longer need Stream-K in old CK. Hence, this PR removes Stream-K from old CK. ## Technical Details All Stream-K artifacts in old CK have been removed including examples, tests, kernels, and CK profiler artifacts. ## Test Plan Ran a CI run on the branch before publishing PR. ## Test Result All tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>	2026-06-08 16:47:26 +00:00
Johannes Graner	0b3c297ee2	[rocm-libraries] ROCm/rocm-libraries#8009 (commit 26ab70d) [CK Tile] Add WAVELET pipeline for forward grouped convolution (#8009) ## Motivation CK Tile forward grouped convolution trails classic CK on 3x3 convolutions whose output-channel count is not divisible by 8, where the narrow output store limits the compute CShuffle epilogue. This ports the WAVELET pipeline (added for backward-weight in #7937) to the forward kernel to close that gap. ## Technical Details - Kernel (`grouped_convolution_forward_kernel.hpp`): WAVELET load/math-wave wiring, mirroring the backward-weight implementation; the non-WAVELET path is unchanged. - Generator: implement `parse_native_fwd_instance`, the forward native-instance parser. - Registered WAVELET instances: profiler bf16 3 / fp16 5, tests 1 each. WAVELET requires input channels divisible by 8 (it does not apply to depthwise). The bf16/fp16 instance asymmetry is intentional and measured: the VecC=8 tiles never beat the compute pool in bf16 but win about 20% of divisible-by-8 3x3 shapes in fp16, so VecC=8 is registered for fp16 only. ## Test Plan - Correctness (CPU reference) for every registered profiler instance, across VecC variants. - Per-shape best-instance performance sweep over the 34 RetinaNet shapes (bf16) and a 200-shape cross-model sweep (bf16 and fp16), compared against classic CK. ## Test Result - Correctness: PASS for all instances. - RetinaNet (bf16, vs classic CK): faster on 28 of 34 shapes, geomean +19.5%; the not-divisible-by-8 shapes up to 3.7x. One 1x1 stride-2 shape stays ~20% behind classic CK, unrelated to WAVELET. - Cross-model (200 shapes): WAVELET wins 3x3 not-divisible-by-8 in both dtypes (up to 61% over the next-best compute instance); for divisible-by-8 3x3 it wins about 20% of shapes in fp16 (3-11%) and none in bf16. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 08:57:39 +00:00
Johannes Graner	b7d59e4b5f	[rocm-libraries] ROCm/rocm-libraries#8099 (commit fc4894b) [CK Tile] Fix Stream-K flag store: wave-uniform SGPR address for scalar s_store/s_load (#8099) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Stream-K grouped-conv (and GEMM) kernels fail to assemble for some instances: the inline scalar flag store/load gets a VGPR address operand, which scalar-memory instructions reject (`invalid operand for instruction`). This blocks Stream-K instances from building. ## Technical Details - `StreamKReductionOps::{Signal,Wait}StorePartialDone` (shared by GEMM and conv, added in #5393) take `kargs` by `const&` and feed `kargs.workspace_ptr` / `cta_idx` into inline `s_store_dword`/`s_load_dword` with `"s"` constraints. For some instantiations the compiler can't keep the pointer wave-uniform and emits a VGPR address. - Fix: route the pointer and offset through `amd_wave_read_first_lane` so the scalar-memory address is a wave-uniform SGPR before the asm. Same instructions, no algorithm change. - Not arch-specific: the affected instance fails on gfx908/gfx90a/gfx942/gfx950 without the fix; whether the compiler spills to a VGPR depends on the instantiation (tile/warp/pipeline), not the target. ## Test Plan - Compile the previously-failing dispatcher instance for gfx908/gfx90a/gfx942/gfx950. - `test_ck_tile_grouped_conv_bwd_weight_streamk` on gfx942, gfx90a, gfx950 hardware. - gfx950 perf A/B (example, bf16/tree, 10 runs each) with vs without the change. ## Test Result - Failing instance now assembles on all four archs; previously failed on every one. - 30/30 conv Stream-K tests pass on gfx942, gfx90a, gfx950. - gfx950 perf delta -0.13% (within run-to-run noise) — no regression from the added readfirstlane on the cold flag path. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 08:57:04 +00:00
Bartłomiej Kocot	28f2966762	[rocm-libraries] ROCm/rocm-libraries#7734 (commit 03ffb9d) [CK] Grouped Convolution Global Load/Store instances ## Motivation Support global load and store in grouped convolutions using instance factory. ## Technical Details - add new instances for each direction - add new tests for large cases ## Test Plan New test for large cases ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1255	2026-06-06 22:52:59 +00:00
Bartłomiej Kocot	2c363870d9	[rocm-libraries] ROCm/rocm-libraries#6744 (commit 9d056e8) [Ck][CK Tile] Global Load/Store for Large Tensors support (#6744) ## Motivation Create solution to support large tensors in the entire ck tile. ## Technical Details - add possiblity to use global load - int64 indexing ## Test Plan conv fwd tests ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-913	2026-06-06 10:14:17 +00:00
Enrico Degregori	1b4fbd95fd	[rocm-libraries] ROCm/rocm-libraries#6089 (commit c876d18) [CK Tile] Extend type support EightWave pipeline ## Motivation EightWave pipeline was designed for 8 bit types. This PR extend support for any FP type ## Technical Details - Generalize policy to support any FP type - Change LDS layout to fix bank conflicts. This removes all bank conflicts in the pipeline (checked for all supported types). Remaining bank conflicts are related to Cshuffle epilogue. ## Test Plan Added GEMM tests with new supported types. Note that FP6 is also supported for MX GEMM but the PR was reverted so no tests were added for it. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 23:54:40 +00:00
Yung-sheng Tu	e826b2eb7e	[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds two specialisations related to `tf32_t`. ## Technical Details This change treats `tf32_t` as a concrete type rather than an empty `struct`. It also adds two new specialisations for MFMA dense builtins and resolves existing circular include issues. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 12:27:41 +00:00
Sami Remes	ad4e2e7624	[rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320) [CK_TILE] [QuantGEMM] Fix SplitK tail handling and other improvements (#7199) This pull request introduces improved and more robust split-K support for quantized GEMM. The main changes add runtime validation, utility functions for split-K batch calculations, pointer offset handling for split-K in grouped kernels, and enhanced support for various tensor layouts. The changes also improve error handling and provide more flexibility for runtime tail handling in split-K pipelines. Split-K Support and Validation Enhancements: * Added runtime validation to ensure `k_batch` is a positive integer and that split-K configurations do not produce empty final batches or mismatched pipeline tails, with detailed error messages and logging for misconfiguration. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250) * Introduced utility functions `get_splitk_batch_k_read` and `get_splitk_last_batch_k` to compute per-batch K read sizes and handle split rounding, ensuring correct and consistent split-K batch partitioning. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R206-R234) [[2]](diffhunk://#diff-635b89bdffa96b2b42f1632520cde36701d7d631e864185591f6b32f7645cf47L104-R107) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L388-R417) [[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250) * Changed the default value of `k_batch` in `QuantGemmHostArgs` to 1 (no split-K) for safer default behavior. Pointer Offsets and Grouped Kernel Handling: * Updated `QuantGroupedGemmKernel` to apply split-K per-batch offsets to all input pointers, mirroring the behavior of non-grouped kernels and ensuring correctness for split-K launches. * Modified AQ tensor view handling to correctly reflect the remaining K-groups from the split-K batch's offset position, improving accuracy for split-K in grouped kernels. Pipeline and Layout Flexibility: * Added support for runtime selection of split-K tail handling via a new template parameter `RuntimeSplitKTail_`, with new helper methods to dispatch GEMM pipelines accordingly. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R273) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1496-R1567) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1427) [[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1447-R1629) [[5]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1459-R1641) * Improved handling for tensor layout cases, including preshuffled B and both row-major and column-major AQ layouts, ensuring correct pointer arithmetic and compatibility checks. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R438-R454) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L464-R516) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)	2026-06-05 11:41:49 +00:00
Enrico Degregori	7b9245f18c	[rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d) [CK Tile] Async support preshuffle GEMM ## Motivation Add async support to existing preshuffle GEMM pipeline ## Technical Details Notes: the implementation avoids previous strategy of duplicating pipelines for async support and instead add a switch `Async` to the ops Problem to enable async pipeline. Then, integrate the async pipeline in the existing one. This allows to avoid code duplication and facilitate the integration of buffer load to lds in existing pipelines. In my opinion, it should be used also for other pipelines which don't support buffer load to lds yet and it would also be a good idea to refactor the existing async GEMM pipelines with the same approach. Summary: - integrate buffer load to lds in existing pipeline - add optimal tensor descriptors for vmem loading and lds reading. They are currently optimized for 16x16 wave tiles but they also work for 32x32 wave tiles. Optimizations for 32x32 wave tile requires different lds layout and it will be done in a follow-up issue - Add async config to examples - Add test (gfx950 only) ## Test Plan New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 07:17:09 +00:00
Aviral Goel	267ca67001	[rocm-libraries] ROCm/rocm-libraries#8028 (commit c1cb112) [CK_Tile] Add wmma_bf16f32_16x16x32_bf16 via fused-downconvert override (#8028) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Adds `__builtin_amdgcn_wmma_bf16f32_16x16x32_bf16` (fp32 accumulate → bf16 output) to the CK Tile WMMA warp-gemm path. API only — the unit test is split into a stacked PR (#8035) so this API change can be reviewed in isolation. ## Changes (4 files) - 16-bit trait: `wmma_intrinsic_downconvert` (calls the bf16f32 builtin — fp32 C in, bf16 C out) plus `COutDataType = bf16_t` / `COutVecType`. - `WarpGemmAttributeWmmaImpl` / `WarpGemmAttributeWmma`: `mac_downconvert(c_fp32, a, b)` (kTransC-aware) returning the bf16 C-output vector. - `WarpGemmImpl`: `mac_downconvert` tail handler producing a bf16 C-output tile from the fp32 accumulator tile, reusing `CWarpDstrEncoding` (output layout identical to the f32 C tile). Verified on gfx1250 (via the stacked test PR #8035): the test passes; the existing WMMA warp-gemm test is unaffected (additive change only).	2026-06-05 05:01:31 +00:00
Enrico Degregori	bdd7a8333d	[rocm-libraries] ROCm/rocm-libraries#6672 (commit bda3f97) [CK Tile] PermuteN support MX GEMM ## Motivation Add PermuteN support to preshuffle MX GEMM ## Technical Details - Modify `shuffle_b_permuteN` to support MX preshuffled layout - Add `preShuffleScalePermuteN` with same functionality of `preShuffleScale` but layout consistent with PermuteN - Include MX pre-processing functions in the library ## Test Plan Add test configuration for permuteN with preshuffle (both FP4 and FP8) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Cong Ma <congma13@amd.com>	2026-06-05 03:04:43 +00:00
Illia Silin	aef7b42883	[rocm-libraries] ROCm/rocm-libraries#7816 (commit f6324af) [CK] Fix latest build issues with staging compiler. ## Motivation Fixing new warnings with staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-04 17:41:09 +00:00
John Afaganis	96c39b331e	[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da) [ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility (#7829) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary CK source files must be compilable via hipRTC (HIP runtime compilation), whose preprocessor does not accept non-ASCII bytes anywhere in a translation unit — including in comments. Bytes that are harmless under `hipcc` (em-dashes, smart quotes, multiplication signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at preprocessing time. These regularly leak in via LLM-assisted authoring or copy/paste from formatted documents and silently break hipRTC paths that are not exercised by the default `hipcc`-based build matrix. This PR (a) cleans every existing violation (53 files) and (b) adds a pre-checkin gate so new violations are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). This is a strict superset of the existing `Clang Format` stage's predicate — `.inc` is added so test-fixture include files are also gated. The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage* in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `79798aa6261` — `[ck] Convert reflect/ rendering to ASCII for hipRTC compatibility` Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `\|- / +- / \| ` (3-col width preserved so alignment is unchanged). `conv_description.hpp` swaps `×` for `x` as the dimension separator. `test_conv_description.cpp` expected strings updated in lockstep so the snapshot test stays green. This is the only commit in the series with observable runtime impact. 2. `738fdb0d81c` — `[ck] Strip non-ASCII bytes from C++ sources for hipRTC compatibility` Mechanical text cleanup across 53 files. Replacements happen in comments or in `std::cout` strings that are not asserted on by any test. None of the 174 `.inc` files in the tree required edits, but they were in the scan's predicate so the enforcement stage's predicate is a superset of what was scanned. Full replacement table in the commit message. 3. `1d7cd8ba235` — `[ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility` - New `projects/composablekernel/script/check_ascii_only.sh` (modeled on `check_copyright_year.sh`). - New entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `ASCII Only Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the existing `Clang Format` stage but with `.inc` added to the find predicate. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 50 known violations; commit 2 leaves 0; commit 3 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains non-ASCII bytes: 1:// em-dash test — here Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.) $ echo $? 1 ``` Full repo scan after the cleanup commits (note the `-name '.inc'` clause): ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_ascii_only.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check` stage runs green over the full predicate (incl. `*.inc`) and existing `Clang Format` stage is unaffected. - [ ] `test_conv_description` passes against the ASCII tree-formatter output (touched in commit 1). - [ ] Local: `pre-commit run ascii-only-checker --all-files` runs cleanly after installing CK pre-commit hooks via `script/install_precommit.sh`. - [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. - [ ] Spot-check a representative subset of touched files under hipRTC compilation to confirm no remaining hipRTC-blocking content (optional, since the static byte check is a sufficient condition for hipRTC preprocessor acceptance on this dimension). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-04 15:00:17 +00:00
Copilot	4fcd73a98e	[rocm-libraries] ROCm/rocm-libraries#7974 (commit 9df2c76) composablekernel: remove stray .hpp.bk backup artifacts (#7974) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four `.hpp.bk` files were accidentally committed to `projects/composablekernel/`, likely as leftovers from a prior merge or conflict resolution. Each is an older snapshot of its `.hpp` counterpart — the canonical `.hpp` files are newer and contain the correct current content. ## Deleted files \| File \| vs. `.hpp` counterpart \| \|---\|---\| \| `ck_tile/core/tensor/tile_window.hpp.bk` \| Older version: uses legacy `bool isL1Cache`/`PrefetchL1` template params; missing `DataCachePrefetchKind`-based prefetch API and `data_cache_prefetch.hpp` include \| \| `ck_tile/core/tensor/load_tile_transpose.hpp.bk` \| Older version: missing `#if defined(__gfx950__)` guard and `Quad` struct (~90 lines) for gfx1250 architecture \| \| `ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp.bk` \| Older version: missing `WmmaTag`, `IsScale16` template param, and several newer dispatcher specializations \| \| `ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp.bk` \| Older version: `KPackA`/`KPackB` (since renamed `KPack`); uses `static_ford` (since refactored to nested `static_for`) \| ## Verification - No other `.bk` files exist in `projects/composablekernel/`. - No build scripts, CMake files, includes, or documentation reference these `.bk` files. - No `.hpp` files were modified.	2026-06-04 03:06:43 +00:00
apophis	42c82b093e	[rocm-libraries] ROCm/rocm-libraries#7786 (commit 7842dfd) [CK TILE][Windows] add `msvc::no_unique_address` support for Windows (#7786) ## Motivation While building Flash Attention 2 with CK backend, this warning will spam in every kernel: ``` DEBUG [1/1837] hipcc.exe ... DEBUG In file included from H:\ROCm\flash-attention\build\fmha_fwd_d32_bf16_batch_b64x64x16x32x32x32_r4x1x1_r4x1x1_w16x16x16_w16x16x16_qr_vr_pssk_nlogits_alibi_mask_lse_ndropout_nskip_nqscale_ntrload_nsink_gfx12.cu:6: DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\example\ck_tile\01_fmha\fmha_fwd.hpp:6: DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core.hpp:111: DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1246:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes] DEBUG 1246 \| [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, PageIdxArray, gl_field_empty_t> DEBUG \| ^~~~~~~~~~~~~~~~~ DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1254:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes] DEBUG 1254 \| [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, index_t, gl_field_empty_t> DEBUG \| ^~~~~~~~~~~~~~~~~ DEBUG 2 warnings generated when compiling for host. ... ``` ## Technical Details `[[no_unique_address]]` is not working on Windows LLVM, should use `[[msvc::no_unique_address]]`. ## Test Plan Build FA2 with CK backend. ## Test Result No warnings, no errors. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-06-04 02:28:12 +00:00
Aviral Goel	e01603bc31	[rocm-libraries] ROCm/rocm-libraries#7725 (commit eef7e12) [GFX1250][CK_TILE] Add scale16 warp gemm unit tests ## Summary - Add scale16 WMMA intrinsic overloads and int64_t forwarding to warp gemm layers for gfx1250 - Add comprehensive wave-level unit tests for scale16 warp gemm (16x16x128 and 32x32x128 tile sizes) - Test all fp8/bf8 type combinations and TransposeC variants - Fix WarpGemm wrapper for non-uniform scale16 configurations Stacked on #7724 (FillUniformScaleDistribution / MX GEMM scale init). Pipeline enablement follows in the next PR.	2026-06-03 22:05:29 +00:00
chris-tsiaousis-hpc	db05d61136	[rocm-libraries] ROCm/rocm-libraries#6212 (commit ccee58d) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR solves several issues: #### More accurate tests for MmaPipelines The current tests for the MmaPipelines (test_amdgcn_sparse_mma, test_amdgcn_wavewise_mma) use explicit input fragment vectors filled with 1s, and only check the output of a single lane. We should have tests that actually use the MmaPipelines with non-trivial input matrices and verify the complete output. Some other aspects of the current MmaPipelines tests that I noticed and deserve some attention: 1. There is sometimes iteration over K outside of the pipeline, which is then included in WaveTileK or FragK, which is not correct. We should remove it, move K iteration inside of the pipeline, or be more clear about this outer-K loop size and how it propagates downwards. 2. There is very tight coupling between the kernel, gtest code, and test_pipeline helper, requiring a lot of information and functions to be passed back and forth. 3. The test_pipeline helper is doing a bunch of register-related logic on the host (related to point 1) 4. Without this register logic the only thing it does is check the device, call the kernel, and check the output, but with a lot of boilerplate. #### Test helper for detecting target arch at HOST runtime There is a really apparent issue we faced while writing tests: Scenario: 1. Compile a test that supports both gfx950 and gfx1201 for gfx950 2. Run the test on a server that only has gfx1201 GPU Actual: Segmentation fault Expected: The test can correctly detect from HOST runtime that the DEVICE target_id was different and skips the test. Notes: The only way of detecting the COMPILER_TARGET_ID in the existing "arch" framework is launching a kernel and calling `get_compiler_target()` (so, from a DEVICE code). This will create a segmentation fault if the current arch differs from the target arch. To cope with this issue, we propose to export the compiler target(s) (note they can be many) through `projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and define a test helper to deal with such cases. #### Add composition support to Transforms We have a small number of Transforms which act on MmaOp input and output data, before and after the MmaOp call respectively. These are currently implemented to work on an MmaTile level, but in theory they are also supposed to work at a WaveTile level, i.e. after composition of multiple MmaTiles to create larger effective MNK dimensions. Currently the composed MmaTiles look like 2D C-style arrays of the individual MmaTile level register vectors (see WaveWiseMmaPipeline). The transforms should be able to take these and perform the proper transforms to the whole WaveTile at once. This might allow for better performing transformations. Note: This PR handles the SparseTransform case and if we don't end up doing scale as a transformation, there isn't really much left to do. If we end up having only the sparse transform as a non-trivial transform, then we could also consider removing the Transform framework.	2026-06-03 14:35:18 +00:00
Ville Pietilä	88f8d24c34	[rocm-libraries] ROCm/rocm-libraries#7936 (commit 3dc91e6) [CK Tile] Fix V6 pipeline applicability and split-image initialization (#7936) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation After adding code generation via CK Tile Dispatcher, some fwd and bwd weight tests for CK Tile convolutions are failing. This PR introduced correct applicability checks and fixes the split-image parameter initialization such that non-applicable instances are not invoked during test execution and split-image instances are correctly initialized. ## Technical Details Investigation revealed two distinct problems 1. For bwd weight, the compute V3 uses prefetch of 3 distinct tiles, which works incorrectly when the number of K-slices addressed by the workgroup is 1. This occurs when a large split-K value is used for a problem that results in a small Gemm-K value. 2. For fwd direction, the current CK Profiler/test infrastructure doesn't initialize the split-image parameters for instance where split-image is enable. Uninitialized split-image values result in non-deterministic behavior where the tests might randomly fail. Fixed problem 1. by adding a check in `IsSupportedArgument` that marks the instance invalid if the `num_loops = ceil(GemmK / (k_batch * KPerBlock)) < 4` for V6 pipeline kernel instances. The check is compile-time eliminated for other kernels. Fixed problem 2. by adding initialization of split-image parameters when split-image is enabled. The default initialization corresponds to full image with no split, i.e., the number of splits is 1 and it has the size of the full image. Added unit tests for the added logic. ## Test Plan Running the following test suites cover the logic added in this PR - test_grouped_convnd_fwd_tile - test_ck_tile_grouped_conv_fwd - test_grouped_convnd_bwd_weight_tile - test_ck_tile_grouped_conv_bwd_weight All test suites above are included in the automated test runs. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 08:40:03 +00:00
Anton Gorenko	7ecbf82708	[rocm-libraries] ROCm/rocm-libraries#7500 (commit f5cd4fd) [CK_TILE][FMHA] Optimize long-context decoding on gfx11/12 (#7500) ## Motivation Relevant issue: ROCM-22065 FMHA has less-than-optimal performance of long-context decoding (i.e. when seqlen_q = 1) on gfx11/12. This PR optimizes the splitkv pipeline and configs for such scenarios. ## Technical Details Optimizations applied in this PR: 1. use tiles with smaller M0 (16 vs 64), these tiles are used when seqlen_q <= 16 2. adapt qr_nwarp_sshuffle pipeline for gfx11, it allows to use more warps even for M0 = 16 (the qr pipeline parallelizes work between warps in M dim so with M0 = 16 it allows to use only 1 warp) 3. enable kMergeNumHeadGroupsSeqLenQ (an optimization that merges one group of heads in GQA) for all hdim values, not only 128 4. increase the number of splits (multiply by the number of head groups) if (3) is used 5. increase the number of splits for RDNAs (`multiProcessorCount` is the number of WGPs on RDNAs, not CUs, so it should be doubled to have meaning similar to CDNAs) Performance on gfx1151: \| Case \| develop (GB/s) \| This PR (GB/s) \| \|:-------\|-------:\|-------:\| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/45056, d:64/64 \| 127.58 \| 183.11 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/45056, d:64/64 \| 153.64 \| 215.02 \| \| [fp16\\|group\\|bshd] b:1, h:16/8, s:1/77184, d:128/128 \| 120.51 \| 225.76 \| \| [fp16\\|group\\|bhsd] b:1, h:16/8, s:1/77184, d:128/128 \| 130.62 \| 223.84 \| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/9600, d:128/128 \| 82.65 \| 138.44 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/9600, d:128/128 \| 105.75 \| 220.45 \| \| [fp16\\|group\\|bshd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.27 \| 187.89 \| \| [fp16\\|group\\|bhsd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.28 \| 188.19 \| ## Test Plan An additional test case is added to the exiting test. It uses seqlen_q = 1, GQA, no mask to trigger the changes ``` ninja test_ck_tile_fmha_fwd_fp16 && bin/test_ck_tile_fmha_fwd_fp16 --gtest_filter="SplitKV ninja test_ck_tile_fmha_fwd_bf16 && bin/test_ck_tile_fmha_fwd_bf16 --gtest_filter="SplitKV ``` Manual testing can be done with these commands: ``` bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=64 -s=1 -s_k=$((352 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=16 -h_k=8 -d=128 -s=1 -s_k=$((603 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=128 -s=1 -s_k=$((75 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=8 -h_k=1 -d=256 -s=1 -s_k=$((3133 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 ``` ## Test Result All the tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 06:16:10 +00:00
Yi DING	01bd52bdb5	[rocm-libraries] ROCm/rocm-libraries#7925 (commit a8f0845) [CK] Fix gfx950 AITER Sync Regressions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Fixes three gfx950 regressions in the AITER downstream CI that surfaced after the internal/gfx1250 re-sync (ROCm/rocm-libraries#6978): > Companion aiter PR: ROCm/aiter#3392 — host-side adaptations (`Kernel::BlockSize()` `constexpr` drops, blockscale `KBatch=1` clamp) plus the CK submodule bump used to validate these fixes together. - FlyDSL MoE AOT cache miss — the AITER MoE tests run with `check_aot_cache=True` and fail on any FlyDSL JIT cache miss, but the CI never pre-compiles the FlyDSL MoE kernels, so gfx950 always misses. Pre-compile them at the start of the AITER test stage. - `buffer.load.lds.v4i32` link error — ROCm/rocm-libraries#6978 reintroduced a clang-version guard mapping `llvm.amdgcn.raw.buffer.load.lds` to a `.v4i32`-suffixed name. That name exists in no LLVM (the rsrc operand is a fixed, non-overloaded `<4 x i32>`, so the intrinsic is never type-mangled), so gfx950 4-DWORD direct-to-LDS (e.g. fp4 MoE bpreshuffle) fails to link with `lld: undefined symbol: llvm.amdgcn.raw.buffer.load.lds.v4i32`. Use the canonical plain name unconditionally. - mixed-precision flatmm warp-GEMM call — ROCm/rocm-libraries#6978 generalized the scaled `WarpGemmImpl::operator()` from a fixed `<index_t opselA, index_t opselB>` signature to a variadic `<typename... Params>` one and updated the `mx_flatmm` pipeline to pass the op-selectors as `OpSelA<>`/`OpSelB<>` types, but missed the mixed-precision flatmm pipeline (`F8xMXF4`/`F16xMXF4`), which still passed raw integer op-selectors. These no longer bind to `typename... Params` (`error: no matching member function for call to 'operator()'`), breaking compilation of the fp8/bf16 × fp4 cktile MoE gemm1 instances on gfx950 (aiter `test_moe_2stage`). Wrap the op-selectors in `OpSelA<>`/`OpSelB<>`. ## Changes - `Jenkinsfile`: pre-compile the FlyDSL MoE AOT cache (`python3 aiter/aot/flydsl/moe.py`) before the AITER tests. - `include/ck/utility/amd_buffer_addressing_builtins.hpp` and `include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp`: drop the `__clang_major__` guard and always use `__asm("llvm.amdgcn.raw.buffer.load.lds")`. The plain name is the canonical one for all sizes including the gfx950 16-byte form, as the upstream LLVM gfx950 tests confirm. - `include/ck_tile/ops/flatmm/pipeline/mixed_prec_flatmm_pipeline_agmem_bgmem_creg_v1.hpp`: wrap the warp-GEMM op-selectors in `OpSelA<>`/`OpSelB<>` at the five call sites, matching the `mx_flatmm` pipeline. ## Test plan Validated via CI.	2026-06-03 02:09:05 +00:00
Illia Silin	5720589311	[rocm-libraries] ROCm/rocm-libraries#7960 (commit ddac5cf) [CK] Upgrade to new gfx1250 compiler and fix build issues (#7960) ## Motivation The docker image we've been using to build for gfx1250 is a few months old, so we need to upgrade. Some of the changes in the latest compiler version require changes in the code. TDM is temporarily disabled due to changes in the lds load/store intrinsics. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 01:58:59 +00:00
Maksim (Max) Podkorytov	d574cc4757	[rocm-libraries] ROCm/rocm-libraries#6696 (commit 9627b91) Replace nested static_for lambdas with compile-time search helper (#6696) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add `sequence_find_value` and `find_in_tuple_of_sequences` compile-time search helpers with O(1) template depth - Replace nested `static_for` lambdas in `TensorDescriptor::GetTransformAndItsUpperDimension` and `InitializeElementSize` - Apply same optimizations to `TensorAdaptor` Supersedes #4287. Conflict-resolved rebase of ROCm/composable_kernel#3600 onto current develop. ## Motivation The `TensorDescriptor` and `TensorAdaptor` classes had excessive template instantiation from: 1. Nested `static_for` loops with lambdas creating unique closure types at every call site 2. `generate_tuple` with lambdas causing per-type instantiation overhead The new helpers use constexpr array lookup and pack expansion instead of recursive template patterns, achieving O(1) template depth. ## Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`) \| TU \| Baseline (mean) \| New (mean) \| Delta \| Wilcoxon p \| Mann-Whitney p \| \|----\|-----------------\|------------\|-------\|-----------\|---------------\| \| `grouped_conv_fwd_xdl_fp16` (host) \| 14,886 ms \| 13,353 ms \| -10.3% \| 0.002 \| 0.0002 \| \| `grouped_conv_fwd_xdl_fp16` (device) \| 27,762 ms \| 25,629 ms \| -7.7% \| 0.002 \| 0.0002 \| \| Total (all TUs) \| 57,732 ms \| 54,030 ms \| -6.4% \| \| \| Unrelated TUs (`device_memory`, `host_tensor`, `convolution_parameter`) show no significant difference (p > 0.3), serving as negative controls. ### Methodology - 10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the same node to eliminate ordering/warmup bias - Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U test (unpaired) - Built with patched clang (LLVM 22) on ctr2-alola-compile-11, `-j1` for accurate per-TU timing - Raw data available in Slurm job 275230 results ## Test plan - [x] 11 unit tests added (5 for `sequence_find_value`, 6 for `find_in_tuple_of_sequences`) - [x] Compile-time benchmark with statistical significance (p < 0.01) - [ ] Full CI Tracking issue: #4229	2026-06-02 23:15:10 +00:00
Aviral Goel	99ab4c4ef7	[rocm-libraries] ROCm/rocm-libraries#7830 (commit 590fe58) [CK_Tile][MI450] Add bf16 output wmma instruction (16x16x32) (#7830) Wire __builtin_amdgcn_wmma_bf16_16x16x32_bf16 into CK Tile for gfx1250, enabling bf16-input bf16-output WMMA at the warp GEMM level. - Add WmmaTraits specialization for <gfx125_t, bf16, bf16, bf16, 16,16,32> - Add WarpGemmAttributeWmmaImpl typedef and WarpGemmWmma alias - Add Dispatcher entry for bf16->bf16 16x16x32 - Add warp_gemm test with reference GEMM validation ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-02 13:54:16 +00:00
Sami Remes	919096fde8	[rocm-libraries] ROCm/rocm-libraries#7935 (commit 5c96097) [CK] Allow skipping split-K C-buffer zero-init in xdl_cshuffle blockscale GEMM (#7935) Add a `skip_zero_init` flag (default false) to the Problem/Argument of the xdl_cshuffle block-scale GEMM device ops (multiple_d ab_scale and blockscale b-preshuffle). When the flag is set, the device invoker skips the internal hipMemsetAsync that zeroes p_c_grid before the KBatch > 1 split-K atomic-accumulation path. The flag is declared on the gridwise Problem struct (inherited by Argument), so it is visible on both the rotating-cache (arg_) and the normal (arg) launch paths in each device op. Why: callers that already pre-zero the output buffer otherwise pay for a redundant device-wide memset before split-K atomic accumulation. Gating the memset behind an opt-in flag lets such callers avoid the duplicate work. Because the flag defaults to false, every existing call site is unaffected and the observable behavior is unchanged. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-02 13:08:46 +00:00
Johannes Graner	b7c8fb164f	[rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d) [CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight instances (#7937) ## Motivation CK Tile had no pipeline competitive with old CK's wavelet on the RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive kernel for spatial small-K shapes. ## Technical Details - New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`): workgroup split into math waves (LDS read + MFMA) and load waves (DRAM read + LDS write). - VGPR role-split: `operator()` has two top-level mutually-exclusive `is_math` branches so the allocator overlays both roles onto the same physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy. Correctness depends on identical `block_sync_lds` counts on both arms plus a matching load-wave barrier stub in the epilogue (`cshuffle_epilogue.hpp`). - Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`): `kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub. Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36 C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27% (googlenet K=320 by 16-23%); the role-split roughly halves the parity gap vs old CK on the 13x13 fp16 shape. ## Test Plan - `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16, `split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a broad 2D sweep. - Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64} (barrier-parity / deadlock check). - `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet instances. ## Test Result - All wavelet instances CPU-verify correct across the split-K sweep; no hangs (dual-arm barrier sequence matches). - Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better vs old CK on the majority of spatial shapes. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-02 08:51:17 +00:00
Chao	c56c6750d0	[rocm-libraries] ROCm/rocm-libraries#6498 (commit 5961a2e) [CK_TILE] Fix conditional rescale numerical instability in FMHA forward (#6498) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [CK_TILE] Fix conditional rescale numerical instability in FMHA forward ## Motivation Fix numerical instability in the conditional O-accumulator rescaling optimization for CK-Tile FMHA forward (FlashAttention-4, Algorithm 6, Eq. 6). The conditional rescale optimization skips the expensive O-accumulator rescale when the running row-max shift is within a threshold (tau = log2(256) = 8.0). The original implementation had a bug: attention weights P were computed in the `m_new` reference frame before the skip/rescale decision. In the skip branch, `m` was reverted to `m_old`, but P remained in the `m_new` frame, causing incorrect softmax normalization. This fix introduces a `p_row_correction` factor: in the skip branch, P is multiplied by `exp2(m_new - m_old)` to bring it back to the `m_old` reference frame. - Correctness: Fixes broken inference on long sequences where running-max drift causes exp2 overflow (observed as degraded image quality on MI350X Flux2 generation) - Performance: Neutral to +4% depending on workload shape ## Technical Details 6 pipeline header files (same pattern in each): - `block_fmha_pipeline_qr_ks_vs.hpp` - `block_fmha_pipeline_qr_ks_vs_async.hpp` - `block_fmha_pipeline_qr_ks_vs_async_trload.hpp` - `block_fmha_pipeline_qr_ks_vs_fp8.hpp` - `block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp` - `block_fmha_pipeline_qs_ks_vs.hpp` In each file: - Lower threshold from 10.0 to 8.0 (tau = log2(256)) - Add `p_row_correction` distributed tensor initialized to 1.0 - Rescale branch: standard rescale of O_acc and l; correction = 1.0 - Skip branch: compute correction = exp2(-acc_scale_log2), update l, revert m, store correction - New `p_spans` sweep applies per-row correction to `p_compute` before P*V GEMM - Move P-to-PDataType cast to after correction sweep ## Dependencies None — this PR is standalone. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=fp16 -v=1 -warmup=1 -repeat=3` ## Test Result Accuracy vs FP32 reference (MI350X, gfx950): \| Shape \| max_diff \| mean_diff \| \|-------\|----------\|-----------\| \| B=1 H=24 M=4096 K=128 bf16 \| 9.1e-4 \| 4.6e-5 \| \| B=4 H=32 M=4096 K=128 bf16 \| 9.9e-4 \| 4.6e-5 \| \| B=1 H=24 M=4096 K=128 fp16 \| 1.2e-4 \| 9.0e-6 \| Performance (MI350X, gfx950, ROCm 7.0): \| Shape \| FA4 (TFlops) \| Always-rescale (TFlops) \| Delta \| \|-------\|-------------\|------------------------\|-------\| \| B=1 H=24 M=4096 K=128 bf16 \| 425.9 \| 428.5 \| neutral \| \| B=2 H=8 M=2048 K=256 bf16 \| 513.9 \| 509.0 \| +1.0% \| \| B=1 H=64 M=2048 K=64 bf16 \| 481.7 \| 464.3 \| +3.7% \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): No regression on MI300X. This correctness fix is performance-neutral. \| Config \| TFlops / GB/s \| Time (ms) \| \|--------\|-------------\|-----------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 342.49 TFlops \| 0.401 \| \| MHA fp16 b=2 h=8 s=4096 d=128 \| 391.70 TFlops \| 0.351 \| \| Causal MHA bf16 b=2 h=8 s=4096 d=128 \| 227.07 TFlops \| 0.303 \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 324.69 TFlops \| 0.423 \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 348.09 TFlops \| 0.790 \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 376.71 TFlops \| 1.459 \| \| Long-seq bf16 b=1 h=16 s=16384 d=128 \| 383.42 TFlops \| 5.735 \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 691.64 GB/s \| 1.554 \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - Uniform scores: softmax output matches FP32 reference (max_diff < 1e-3) - Large seqlen (4096+): no overflow or NaN in O-accumulator - Spike pattern: correct handling of sudden row-max jumps - Multiple spikes: correction applied correctly across multiple skip/rescale transitions - Deterministic: identical outputs across repeated runs - No performance regression on standard workloads	2026-05-30 10:34:06 +00:00
Tianyuan Wu	22a99f97e8	[rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93) [CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile ## Motivation This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and improves warp GEMM unit test coverage/structure for gfx1250-specific cases. ## Technical Details - Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added IsScale16 plumbing to warp GEMM dispatcher path - Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250 coverage to verify all F4 WMMA paths ## Test Plan Run ./test_ck_tile_wg_32x16x128_fp4. ## Test Result ``` ./test_ck_tile_wg_32x16x128_fp4 [----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (1751 ms total) [ PASSED ] 3 tests. ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-30 01:28:48 +00:00
Hosang Yoon	e7e8801dc3	[rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7) [CK_TILE] Use gfx11 float buffer atomics in FMHA Bwd ## Motivation FlashAttention CK backward on gfx11 can hit out-of-bounds/tail writes in the dQ accumulator atomic-add path when sequence rows are padded at the tile level but not marked invalid in the DQDKDV main tensor view. With the generic global atomic fallback, an incorrectly-valid tail element can issue an actual pointer-based `atomicAdd`. With the buffer atomic path, the write is issued through a buffer resource with bounds information and follows the same backend already used by gfx9/gfx12. This fixes the gfx11 FMHA BWD failure without changing the gfx11 default for unrelated CK Tile kernels. ## Technical Details This PR enables the existing CK Tile AMD buffer float atomic-add path only for generated FMHA BWD gfx11 translation units. gfx11 normally uses the generic global atomic fallback for floating-point `buffer_view::atomic_add`. That fallback performs the atomic through a raw computed pointer and depends on the software validity predicate to avoid invalid elements. In FMHA BWD dQ accumulation, padded tail rows can reach this path, so using the buffer atomic backend is safer: it uses a buffer resource with base pointer, bounds information, and an element offset, matching the backend already used by gfx9/gfx12. Enabling `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT` globally for gfx11 is too broad and can break unrelated gfx11 CK builds such as GEMM. Instead, `config.hpp` now preserves an explicitly pre-defined `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT`, while keeping the existing default disabled for gfx11. ## Test Plan Validated the change with the FlashAttention CK full test suite with backward pass enabled on gfx11. pytest -q -s tests/test_flash_attn_ck.py ## Test Result FlashAttention CK gfx11 test result: 260680 passed, 152076 skipped ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-05-30 00:10:26 +00:00
Emily Martins	95c916369c	[rocm-libraries] ROCm/rocm-libraries#7584 (commit 060bad5) [CK_TILE] Fix Stream-K k_size calculation ## Motivation In a recent benchmarking task for CK Tile Stream-K algorithm, we identified that certain instances segfault. This change works to fix the bug and adds necessary regression tests. ## Technical Details The StreamK kernel constructs tensor views using a `k_size` parameter that determines how much of the K dimension to process in each iteration. Previously, this was calculated as: ```cpp index_t k_size = num_loop_sk * TilePartitioner::KPerBlock; ``` This calculation assumes all macro tiles along K are exactly `KPerBlock` in size. However, when `K % KPerBlock != 0`, the final macro tile along K has a remainder size of `K % KPerBlock`, not a full `KPerBlock` (see the figure below): <img width="961" height="488" alt="image" src="https://github.com/user-attachments/assets/3e1cceed-5dcd-4980-8b02-cee24eecf262" /> With the old code, a workgroup working with the `MPerBlock x (K % KPerBlock)` tile in A and B risk accessing illegal memory. Hence, this change ensures that when `K % KPerBlock != 0`, workgroups processing iterations that include the final macro-tile along K calculate the correct `k_size` based on the remainder rather than assuming a full `KPerBlock`. ## Test Plan I added the following tests: 1. Unit tests added for the Stream-K Tile Partitioner: - `StreamKTilePartitionerBaseGetKSize/NoRemainderTiles` - validates full tiles - `StreamKTilePartitionerBaseGetKSize/RemainderTiles` - validates remainder handling 2. Regression tests that test a case where `K % KPerBlock != 0` ## Test Result Tests passed locally on gfx90a, gfx942, and gfx950. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 21:36:49 +00:00
Aviral Goel	15c904b460	[rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a) ck_tile: add FillUniformScaleDistribution and fix MX GEMM scale init (#7724) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Problem MX GEMM pipeline tests were passing vacuously: scale bytes were drawn from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so numerical checks passed without exercising the GEMM. ### Changes `include/ck_tile/host/fill.hpp` — new `FillUniformScaleDistribution<ScaleType>` functor - Accepts human-readable float bounds and maps them to the raw byte range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE 754 exponent into the type's bias space - Sampling is uniform over raw bytes → uniform over representable values - Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to avoid shifting negative signed integers (C++17 UB) - Adds `assert(min_r <= max_r)` to catch inverted-range UB when both bounds exceed the type's representable range - Provides default member values (0.125f, 2.0f) and `std::optional` seed consistent with sibling fillers - `/** /` Doxygen style with `@note` on snapping asymmetry `test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`* — fix scale initialization - Replace manual byte-range distribution with `FillUniformScaleDistribution<>{0.125f, 2.0f}` - Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid correlated scale tensors that were causing 60 test failures for fp4+e5m3/e4m3 combinations `test/ck_tile/utility/test_fill.cpp` — new unit tests for `FillUniformScaleDistribution` - 16 typed tests across e8m0, e4m3, e5m3: validity, range, reproducibility, coverage, snapping, stress, nullopt seed, and range overload - Test helper `expected_raw_range` mirrors implementation clamping exactly	2026-05-29 18:45:13 +00:00
Andriy Roshchenko	d5c9215064	[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f) [CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC ## Motivation Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async load instruction for tensor A, which complements the existing MX GEMM FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines are enabled on gfx1250. ## Technical Details The existing gfx950 implementation was extended to support gfx1250 architecture. All three MX FP data types are supported across the two ASICs. It should be noted that while the TDM pipeline uses an emulated 32x32x128 warp-tile instruction, the present submission relies on the built-in 16x16x128 instruction, called 4 times per warp. ## Test Plan Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250 functionality. To help facilitate the testing in development, `example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was introduced to verify various combinations of supported data types and pipeline versions. ## Test Result The present submission is expected to work on both gfx950 and gfx1250 hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. - [x] Relies on #6978 and should only be merged after the changes are merged to the `develop`.	2026-05-29 17:02:45 +00:00
Bartłomiej Kocot	5d912538d3	[rocm-libraries] ROCm/rocm-libraries#7847 (commit b995ef2) [CK] Remove IsPackedTensor function ## Motivation Fix codegen hipRTC ## Technical Details Remove not needed function. Since MakeArgument supports long_index_t strides. ## Test Plan Codegen tests. ## Test Result Passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 14:00:06 +00:00
Ville Pietilä	78d657c4f7	[rocm-libraries] ROCm/rocm-libraries#7284 (commit e7d25b2) [CK_TILE] Integrate CK Tile Dispatcher code generation into CK Tile Profiler (#7284) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation CK Tile is going to be delivered to hipDNN via CK Dispatcher. Currently the CK Tile Profiler using CK Builder for generating the profiled instances from the configuration files that identify the instances that old CK exposes. We need to replace this instance generation with the CK Tile Dispatcher codegen. ## Technical Details The old CK Profiler config files are converted to JSON files that the CK Tile Dispatcher can digest. The conversion script for configurations is stored to source control in case we need to update the JSON configurations later. The dispatcher generates instance libraries per conv direction (fwd, bwd data, and bwd weight) that are linked to the CK Profiler executable. I also implemented codegne for the stream-K and depthwise conv instances. The proposed solution replaces the CK Builder codegen with the CK Tile Dispatcher codegen. There are two new methods that are exposed via the dispatcher backend - `is_supported` - required to enabled the profiler workflow where we check the applicability of the kernel instance before running it. - `get_instance_string` - this mainly for verification. This provide the CK Builder instance string for verifying that the old CK Builder based profiler and the new CK Tile Dispatcher based profiler have the same instances. The rules that limit the generated instances are now collected to a single location under the dispacther. The CK Builder codegen uses these, which ensures that the two codegen pipelines are in sync. The next step (different PR) is to remove the CK Builder codegen pipeline altogether. ## Test Plan Verified that the old CK Builder based profiler and the new CK Tile Dispatcher based profiler have the same instances, that is, the Dispatcher based codgen can generate the same instances as the old CK Builder. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-28 21:03:37 +00:00
ltqin	bf07a0150e	[rocm-libraries] ROCm/rocm-libraries#7723 (commit 4ed6c51) [CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels (#7723) ### Motivation The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not support LSE (Log-Sum-Exp) output. This PR enables LSE output support for fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics alongside attention outputs. ### Technical Details - StandardAttention: lse = softmax_scale * m + log(l) - LogitsSoftCap: lse = (m / log2(e)) + log(l) ### Test Plan Run FMHA forward example with fp8bf16 precision and LSE output enabled: - Test 1: Basic LSE functionality ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1 - Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter) ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0	2026-05-28 15:58:54 +00:00
Zoltán Lakatos	58e2ab1fc7	[rocm-libraries] ROCm/rocm-libraries#6761 (commit d19f6f1) [CK] Large tensor gemm workaround (#6761) ## Motivation Customer qeruested large tensor gemm support for 8bit and 4bit data types. Currently CK triggers “This GEMM not supported” error. The root cause appears to be the 2 GB limit on the input/output matrix, triggered by buffer offset constraints when testing a larger shape such as M = 699,904 (which is an exact multiple of MPerBlock = 256). ## Technical Details Quick workaround to have support ASAP. Split the tensors into inputs / outputs smaller than 2GB limit. Iterate on host and call all subproblems without device code change. Support is restricted to rowise layout in A, Ds and E All changes were implemented in DeviceGemm structures to avoid secondory affect on grouped convolutions. Got lots of AI generated comments. Addressed the ones that seemed relevant on the functionality. ## Test Plan Within CK the following examples can be used with modified input sizes: example_gemm_multiply_multiply_xdl_fp8 example_gemm_mx_fp4 Tested with Aiter tuning on provided shapes. ## Test Result All gemms run and provide correct results. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-05-27 18:55:15 +00:00
Michael Halkenhäuser	5dc8fbd1a8	[rocm-libraries] ROCm/rocm-libraries#6900 (commit `28608c2`) [CK] Fix and expand CK's commit records in version.h (#6900) ## Motivation In `version.h` of a CK installation `CK_COMMIT_ID` would be empty for out-of-source builds. Additionally, if it worked, it would show the parent repo's (`rocm-libraries`) commit. ## Technical Details Dropped "required" constraint so "unknown" string becomes a graceful option. Changed process of determining the CK commit, now uses `WORKING_DIRECTORY`. Thus, `CK_COMMIT_ID` holds only the last CK-relevant commit. Added `CK_PARENT_COMMIT_ID` which holds the parent's, e.g. `rocm-libraries`, commit. This can be the same as `CK_COMMIT_ID`, or not even applicable, depending on the scenario. ## Test Plan Ran CMake configuration and installation of CK to verify happy path. ## Test Result Commit SHA's showed the expected values depending on the repo state. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 10:17:02 -07:00

1 2 3 4 5 ...

1665 Commits