composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Aviral Goel	4aecc8de5b	[rocm-libraries] ROCm/rocm-libraries#7442 (commit b7d57ef) [CK] CompV4: remove redundant barrier (+5.7% gfx942, +1% gfx950) (#7442) ## Summary - Remove one redundant `block_sync_lds()` from the pong phase of the CompV4 GEMM pipeline hot loop - The pong phase had 2 barriers while ping had 1 — the second pong barrier (after LDS writes, before global loads) was unnecessary because the sync at the top of the next ping iteration already ensures LDS coherence - Removing this barrier allows global loads to overlap with LDS write drain, restoring the latency hiding the ping-pong design was built to provide - Abstracting away Ping Pong phases into generic lambda avoids making such mistake again. ## Benchmark ### gfx942 (MI300X), 86 fp16 GEMM shapes \| Metric \| Value \| \|---\|---\| \| Improved (>1%) \| 80 \| \| Neutral (±1%) \| 4 \| \| Regressed \| 2 \| \| Average gain \| +5.7% \| \| Best gain \| +18.0% (4096x256x16384) \| \| Worst regression \| -2.9% (12288x3072x4096) \| ### gfx950 (MI355X), 86 fp16 GEMM shapes \| Metric \| Value \| \|---\|---\| \| Improved (>1%) \| 32 \| \| Neutral (±1%) \| 54 \| \| Regressed \| 0 \| \| Best gain \| +9.0% (4096x2048x28672) \| ### Top gains by workload \| Shape (MxNxK) \| Source \| gfx942 BL \| gfx942 Opt \| gfx942 Gain \| gfx950 BL \| gfx950 Opt \| gfx950 Gain \| \|---\|---\|---\|---\|---\|---\|---\|---\| \| 4096x256x16384 \| bloom_fc2 \| 38.3 \| 45.2 \| +18.0% \| 75.6 \| 77.0 \| +1.9% \| \| 4096x512x22016 \| llama2_7b \| 77.8 \| 90.8 \| +16.7% \| 152.4 \| 154.9 \| +1.7% \| \| 256x1536x7168 \| deepseek \| 14.4 \| 16.7 \| +16.0% \| 27.2 \| 28.0 \| +2.8% \| \| 4096x1024x22016 \| llama2_7b \| 156.2 \| 180.8 \| +15.7% \| 304.8 \| 311.6 \| +2.2% \| \| 4096x1024x16384 \| bloom_fc2 \| 154.6 \| 178.5 \| +15.4% \| 303.1 \| 309.5 \| +2.1% \| \| 4096x4096x22016 \| llama2_7b \| 371.0 \| 412.3 \| +11.1% \| 819.8 \| 823.6 \| +0.5% \| \| 4096x2048x28672 \| llama3_8b \| 235.5 \| 259.5 \| +10.2% \| 530.0 \| 577.7 \| +9.0% \| \| 250880x256x4096 \| bloom_logits \| 289.0 \| 335.9 \| +16.2% \| 595.5 \| 599.1 \| +0.6% \| \| 8192x8192x8192 \| square \| 411.8 \| 432.9 \| +5.1% \| 825.1 \| 825.8 \| +0.1% \| \| 7168x4096x8192 \| llama70b \| 362.9 \| 374.7 \| +3.3% \| 775.8 \| 782.5 \| +0.9% \| ## Hardware counter analysis (rocprof-compute, 8192x8192x8192, gfx942) \| Metric \| Baseline \| Optimized \| Delta \| \|---\|---\|---\|---\| \| s_barrier per ping+pong \| 5 \| 4 \| -1 \| \| MFMA Utilization \| 47.8% \| 55.5% \| +7.7pp \| \| IPC \| 0.17 \| 0.21 \| +23.5% \| \| MFMA F16 % of peak \| 30.6% \| 33.5% \| +2.8pp \| \| VALU (instructions) \| 41.67M \| 41.67M \| identical \| \| MFMA (instructions) \| 65.91M \| 65.91M \| identical \| \| Spill/Stack Read \| 8.27M \| 8.27M \| identical \| All instruction counts are identical — the optimization removed one synchronization point, not any compute instructions. ## Correctness - gfx942: GPU verification (`-v=2`) passed on 4 shapes (8192x8192x8192, 4096x4096x4096, 22016x4096x4096, 4096x512x28672) - gfx950: GPU verification (`-v=2`) passed on all 86 shapes	2026-05-27 12:23:43 -04:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
assistant-librarian[bot]	6181eb2adf	[rocm-libraries] ROCm/rocm-libraries#4279 (commit 5b3f4b7) [CK_TILE] Stream-K XCD remapping (#4279) ## Proposed changes This PR adds support for XCD remapping as detailed in this [document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A). On gfx942, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this [PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM. Note: This only applies to the gfx942 as they introduce the XCDs. Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged --- 🔁 Imported from [ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652) 🧑‍💻 Originally authored by @arai713 --------- Co-authored-by: Astha <astha.rai713@gmail.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>	2026-05-26 09:43:03 -07:00
Anton Gorenko	66d6714376	[rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd) [CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388) ## Motivation Improve precision of mxfp4 without performance penalties. ## Technical Details Since performance of scale MFMAs is the same when neither A nor B is fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the second GEMM, while types of Q, K, V stay the same. This allows to improve overall precision significantly because fp6 has 32 non-negative values used for P quantization compared to just 8 values for fp4. It was found that there is a compiler bug with `__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in LCOMPILER-561) but a workaround seems to fix all failing instances. ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-26 06:55:17 -07:00
JP-Fernando	74bc86240b	[rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a) [CK Tile] Add gemm universal preshuffle to MX GEMM (#5647) ## Motivation Add gemm universal preshuffle support to existing MX GEMM pipeline. The straightforward way to do this is to port the `mx_flatmm` pipeline to the existing `gemm_mx` framework. ## Technical Details The `mx_flatmm` pipeline was not deleted, to allow for back-compatibility. ## Test Plan Add `preshuffle` option to example: `tile_example_mx_gemm`. Add new configurations with enabled preshuffle to the existing `test/ck_tile/gemm_mx` tests. ## Test Result Example and tests were successful on `gf950` architecture in the `Alola` cluster. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-05-22 16:07:53 +02:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
JiaLuo-CAN	5ff7497fa7	[rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4) [CK Tile] Fix Grouped Gemm quant mixed precision (#7537) <Migrate from Internal repo PR> test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8 cases: std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant, False, True, False>, std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant, False, True, False> GFX1250 would fail with incorrect results, GFX950 would fail when compiling BF8+FP8 and give incorrect results for FP8+BF8. The issue is due to the wrong ComputeDataType selection. The fix is to consider original ADataType and BDataType even when ComputeDataType is not void. For compiling error on gfx950, the bf8, fp8, 16x16x32 warp Gemm is added.	2026-05-21 08:36:23 -07:00
JP-Fernando	e7798e9560	[rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac) Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112) ## Motivation The goal of this work is to apply XOR shuffle (swizzle) to the current `comp_async` GEMM pipeline and the `gemm_mx` pipeline. XOR swizzling has been helpful to avoid LDS bank conflicts, as data are redistributed across LDS banks, such that simultaneous threads accessing different rows land on different LDS banks. ## Technical Details A similar approach to the work in the existing eight-waves pipeline was followed. Currently, XOR swizzle support is available for FP8 and BF8 types. FP4 support is also available for MX GEMM. Should the types not match, or should the async vector width be of an unsupported size, then the pipeline falls through to the previously existing ('unswizzled') path. ## Test Plan Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM pipeline. Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for the MX GEMM pipeline. ## Test Result The tests passed successfully in the `Alola` cluster with MI350 hardware. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-05-21 09:36:41 +02:00
Aviral Goel	458dd0ac4c	[rocm-libraries] ROCm/rocm-libraries#7130 (commit 9e1e065) [CK_TILE] Redesign LDS store API with pre-computed window coordinates (+15% MI355X, +6% MI300X) (#7130) ## Summary - Redesign the LDS store API to separate window creation from memory transfer - Add `MakeDistributedLdsStoreWindow` factory, `LocalStore` (fast path), and `LocalStoreWithCoordRecompute` (slow path) to the pipeline base class - Convert CompV3 as the reference implementation - Document the slow/fast path distinction across core tensor headers ## Motivation `LocalPrefill` hides a performance cliff: when given a bare `tile_window_with_static_lengths`, it silently reconstructs `tile_window_with_static_distribution` on every call — paying significant VALU overhead (~96 for typical configurations) for XOR coordinate computation. The cost is invisible at the call site. The new API makes the cost explicit via three verbs: \| Verb \| Method \| Cost \| When to use \| \|------\|--------\|------\|-------------\| \| Create \| `MakeDistributedLdsStoreWindow(bare, dstr)` \| VALU (once) \| Before hot loop, when VGPR budget allows \| \| Store (fast) \| `LocalStore(precomputed_window, tensor)` \| 0 VALU for coords \| Pre-computed window available \| \| Store (on-the-fly) \| `LocalStoreWithCoordRecompute(bare, tensor)` \| VALU per call \| VGPR budget tight, or one-shot stores \| Both `LocalStore` and `LocalStoreWithCoordRecompute` enforce correct window types via `static_assert`. `LocalPrefill` is retained for backward compatibility (69 call sites across 6 pipeline files). ## Performance ### 86 Shapes, CompV3_2 (128×128 tile), fp16, RCR layout gfx942 (MI300X): 86/86 improved, 0 regressions. Average gain: +6.2% gfx950 (MI355X): 85/86 improved, 1 neutral, 0 regressions. Average gain: ~+15% <img width="2777" height="1178" alt="pr7130_perf_chart" src="https://github.com/user-attachments/assets/b2f5c406-eb20-469d-8da6-dd608c28fbcc" /> \| Shape (MxNxK) \| Source \| gfx942 \| gfx950 \| \|---\|---\|---\|---\| \| 22016x256x4096 \| llama2_7b_fc1 \| +5.3% \| +11.4% \| \| 22016x512x4096 \| llama2_7b_pfill \| +5.9% \| +10.9% \| \| 4096x512x22016 \| llama2_7b_pfill \| +7.6% \| +28.5% \| \| 22016x1024x4096 \| llama2_7b_pfill \| +6.1% \| +10.1% \| \| 4096x1024x22016 \| llama2_7b_pfill \| +7.4% \| +17.2% \| \| 22016x4096x4096 \| llama2_7b_pfill \| +5.2% \| +9.3% \| \| 4096x4096x22016 \| llama2_7b_pfill \| +6.0% \| +9.3% \| \| 4096x4096x4096 \| llama2_7b_pfill \| +5.7% \| +10.6% \| \| 28672x256x4096 \| llama3_8b_fc1 \| +5.4% \| +12.2% \| \| 28672x512x4096 \| llama3_8b_pfill \| +4.9% \| +6.4% \| \| 4096x512x28672 \| llama3_8b_pfill \| +7.4% \| +1.5% \| \| 28672x2048x4096 \| llama3_8b_pfill \| +4.9% \| +8.6% \| \| 4096x2048x28672 \| llama3_8b_pfill \| +6.4% \| +8.4% \| \| 28672x8192x4096 \| llama3_8b_pfill \| +5.4% \| +8.0% \| \| 7168x1024x8192 \| llama70b_pfill \| +6.6% \| +10.8% \| \| 8192x1024x7168 \| llama70b_pfill \| +6.4% \| +11.4% \| \| 7168x4096x8192 \| llama70b_pfill \| +6.2% \| +9.6% \| \| 16384x256x4096 \| bloom_fc1 \| +6.4% \| +20.3% \| \| 16384x512x4096 \| bloom_fc1 \| +5.8% \| +8.5% \| \| 16384x1024x4096 \| bloom_fc1 \| +6.0% \| +10.9% \| \| 16384x2048x4096 \| bloom_fc1 \| +5.3% \| +10.1% \| \| 16384x3072x4096 \| bloom_fc1 \| +5.5% \| +8.8% \| \| 16384x4096x4096 \| bloom_fc1 \| +5.7% \| +8.8% \| \| 4096x256x16384 \| bloom_fc2 \| +7.8% \| +33.6% \| \| 4096x512x16384 \| bloom_fc2 \| +7.5% \| +31.6% \| \| 4096x1024x16384 \| bloom_fc2 \| +7.1% \| +17.1% \| \| 4096x2048x16384 \| bloom_fc2 \| +6.9% \| +11.0% \| \| 4096x3072x16384 \| bloom_fc2 \| +6.8% \| +11.0% \| \| 4096x4096x16384 \| bloom_fc2 \| +6.7% \| +10.3% \| \| 12288x256x4096 \| bloom_inproj \| +6.7% \| +22.0% \| \| 12288x512x4096 \| bloom_inproj \| +6.2% \| +9.8% \| \| 12288x1024x4096 \| bloom_inproj \| +5.9% \| +12.4% \| \| 12288x2048x4096 \| bloom_inproj \| +5.8% \| +10.1% \| \| 12288x3072x4096 \| bloom_inproj \| +5.4% \| +10.1% \| \| 12288x4096x4096 \| bloom_inproj \| +5.7% \| +9.1% \| \| 250880x256x4096 \| bloom_logits \| +2.6% \| +0.5% \| \| 4096x256x4096 \| bloom_outproj \| +7.1% \| +28.4% \| \| 4096x512x4096 \| bloom_outproj \| +6.8% \| +27.4% \| \| 4096x1024x4096 \| bloom_outproj \| +6.5% \| +21.3% \| \| 4096x2048x4096 \| bloom_outproj \| +5.9% \| +13.1% \| \| 4096x3072x4096 \| bloom_outproj \| +5.9% \| +12.0% \| \| 16x1536x7168 \| deepseek \| +7.7% \| +34.7% \| \| 32x1536x7168 \| deepseek \| +7.7% \| +34.9% \| \| 64x1536x7168 \| deepseek \| +7.6% \| +31.3% \| \| 128x1536x7168 \| deepseek \| +7.6% \| +25.8% \| \| 256x1536x7168 \| deepseek \| +7.7% \| +27.9% \| \| 512x1536x7168 \| deepseek \| +7.6% \| +29.1% \| \| 1024x1536x7168 \| deepseek \| +7.3% \| +28.8% \| \| 2048x1536x7168 \| deepseek \| +6.9% \| +20.5% \| \| 4096x1536x7168 \| deepseek \| +6.3% \| +11.0% \| \| 8192x1536x7168 \| deepseek \| +6.2% \| +11.3% \| \| 16384x1536x7168 \| deepseek \| +6.0% \| +9.1% \| \| 20480x1536x7168 \| deepseek \| +4.8% \| +9.3% \| \| 16x3072x1536 \| deepseek \| +6.3% \| +25.1% \| \| 32x3072x1536 \| deepseek \| +6.4% \| +25.3% \| \| 64x3072x1536 \| deepseek \| +6.4% \| +24.8% \| \| 1024x1024x1024 \| square \| +5.5% \| +18.7% \| \| 2048x2048x2048 \| square \| +6.0% \| +19.2% \| \| 3584x3584x3584 \| square \| +5.3% \| +11.2% \| \| 5120x5120x5120 \| square \| +6.1% \| +10.0% \| \| 6144x6144x6144 \| square \| +5.5% \| +9.8% \| \| 8192x8192x8192 \| square \| +6.0% \| +8.2% \| \| 1024x4608x1024 \| midsize \| +4.6% \| +4.6% \| \| 512x18432x512 \| midsize \| +1.9% \| +10.1% \| \| 4096x18432x4096 \| midsize \| +5.8% \| +8.8% \| \| 320x8192x320 \| stablediff \| +4.0% \| +11.3% \| \| 640x2048x640 \| stablediff \| +4.5% \| +14.0% \| \| 320x8192x1280 \| stablediff \| +5.6% \| +20.1% \| \| 1x1280x8192 \| skinny_m1 \| +7.7% \| +35.3% \| \| 1x8192x1024 \| skinny_m1 \| +6.0% \| +20.3% \| \| 1x7168x8192 \| skinny_m1 \| +7.7% \| +36.6% \| \| 1x8192x3584 \| skinny_m1 \| +7.3% \| +27.9% \| \| 1x13312x6656 \| skinny_m1 \| +7.6% \| +30.3% \| \| 1x13312x16384 \| skinny_m1 \| +7.8% \| +4.2% \| \| 1x16384x6656 \| skinny_m1 \| +7.5% \| +28.7% \| \| 1x16384x16384 \| skinny_m1 \| +7.7% \| +2.3% \| \| 16x4096x4096 \| skinny_m16 \| +7.4% \| +31.9% \| \| 16x22016x4096 \| skinny_m16 \| +7.5% \| +26.5% \| \| 16x28672x4096 \| skinny_m16 \| +7.0% \| +15.1% \| \| 16384x1280x8192 \| skinny_m16 \| +5.6% \| +8.7% \| \| 16384x8192x1024 \| skinny_m16 \| +4.5% \| +8.8% \| \| 2048x4096x2048 \| mixed \| +4.7% \| +9.0% \| \| 4096x2048x8192 \| mixed \| +6.8% \| +11.0% \| \| 8192x4096x4096 \| mixed \| +5.2% \| +10.0% \| \| 1x4096x4096 \| mixed \| +7.4% \| +32.4% \| \| 1024x1024x4096 \| mixed \| +7.1% \| +27.4% \| ### ISA Hot Loop Diff (LBB1_32, per K-iteration, gfx942) \| Metric \| Baseline \| Optimized \| Delta \| \|--------\|----------\|-----------\|-------\| \| Total VALU \| 621 \| 500 \| -121 \| \| VGPR / SGPR \| 512 / 96 \| 512 / 96 \| unchanged \| ### Hardware Counters — Instruction Mix (gfx950, rocprofiler-compute) Profiled on MI350X, shape 4096×256×16384 (bloom_fc2). Instruction counts are deterministic hardware counters. \| Metric \| Baseline \| Optimized \| Δ \| \|--------\|----------\|-----------\|---\| \| VALU instructions/kernel \| 4,642,473 \| 987,958 \| −78.7% \| \| INT32 VALU \| 2,592,786 \| 541,129 \| −79.1% \| \| Instructions / wavefront \| 39,178 \| 24,400 \| −37.7% \| \| VGPRs (avg) \| 98 \| 90 \| −8% \| \| MFMA instructions \| 2,059,702 \| 2,059,702 \| 0% \| \| LDS instructions \| 1,564,891 \| 1,564,891 \| 0% \| \| VMEM instructions \| 520,996 \| 520,996 \| 0% \| MFMA as fraction of total instructions: 30.7% → 67.5%. Eliminating ~3.65M redundant INT32 VALU instructions (XOR coordinate recomputation per K-iteration) leaves the scheduler more headroom for MFMA dispatch, directly explaining the benchmark gains.	2026-05-19 20:22:37 -04:00
Enrico Degregori	9565ca21ec	[rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2) [CK Tile] Eight Waves pipeline for MX GEMM (#5552) ## Motivation Integrate Eight Waves pipeline in MX GEMM ## Technical Details - EightWaves pipeline: - Add pipeline, policy and block gemm (internally using existing implementation used by GEMM and ABQuant) - Extend support of EightWaves policy for FP4 (packed types) - Async pipeline: - Fix pipeline with packed scales (requires MRepeat and NRepeat to be contiguous) - block gemm specific for MX GEMM is defined because distribution encodings have changed - CShuffle: - Add new functionality to support MRepeat and NRepeat contiguous (defined by `TilesPacked`) - Examples: - Refactor examples to easily switch different configurations (similar to GEMM universal) - Scales values generated consistently with other microscale implementations in CK Tile - Add configuration for EightWaves pipeline - Tests: - Unify existing FP8 and FP4 tests - Add tests for EightWaves pipeline - Scales values generated consistently with other microscale implementations in CK Tile Note: FP6 support for MX GEMM was added later and the support for the Eight Waves pipeline will be done in following PR ## Test Plan Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and FP8 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-19 11:53:19 -07:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
ltqin	501e7ef12a	[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057) [CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574) ## Summary Add a CK_TILE forward kernel implementing [SageAttention v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that applies multi-granularity quantization to Q/K/V before computing attention, trading minimal accuracy loss for higher throughput on low-precision hardware. ### Quantization design \| Tensor \| Supported data types \| Scale granularity options \| \|--------\|---------------------\|--------------------------\| \| Q \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (32 tokens), per-thread (4 tokens) \| \| K \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (64 tokens), per-thread (16 tokens) \| \| V \| fp8 \| per-channel (always) \| \| O \| bf16 \| — \| Three precision combinations are supported: `fp8/bf16` (QKV fp8, O bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK int4, V fp8, O bf16). ### Architecture support - gfx9 (CDNA2/3, e.g. gfx90a, gfx942) — full tile set - gfx950 (CDNA4) — restricted tile set (N-per-block capped at 64 for fp8-family dtypes) ### Implementation - Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC` (async copy) - Masking support: no mask, causal (top-left / bottom-right), and generic windowed - Batch and group (variable-length) modes - Head dimension: d=128, d_v=128 - Python codegen under `example/ck_tile/49_sageattention/codegen/` generates kernel instances per target/dtype/tile combination - Smoke tests included via `tile_example_sageattn_fwd` ### Test commands \`\`\`bash # fp8 QKV ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=fp8bf16 -qscale=3 -init=3 # int8 QK, fp8 V ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=i8fp8bf16 -qscale=3 -init=3 \`\`\` \`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread	2026-04-30 11:32:23 -07:00
Qianfeng	0e6a514e4f	[rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e) Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209) ## About qr_ks_vs_whole_k_prefetch pipeline This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on both MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where the number of workgroups is low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles. ## Changes: - Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to avoid using shuffle instructions on MI350 - Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate `clear_tile` calls - Adds full support for hdim96/hdim160 without padding requirements - Updates MFMA instruction selection to ensure optimal choices for MI350 ## Performance results 1. For attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms` by whole_k_prefetch_trload & `58.50ms` by async_load), and `qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better performance than the recently tuned `qr_ks_vs_async` on the same case (execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by qr_ks_vs_async) 2. Also on MI300, for attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch` shows much better performance than the `qr_ks_vs_async` (which is supposed to be very high-efficient) on the same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by qr_ks_vs_async) 3. For attention shapes which leads to kM0=128, `qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows completely on-par performance on MI300 ## Test/Verify 1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far 2. Use the following command-line for building/testing xformers >```bash > #> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers > #> git submodule update --init --recursive > #> pip install --no-build-isolation -e ./ > #> pytest tests/test_mem_eff_attention.py::test_forward >``` 4. Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines > ```bash > #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3 and qr_ks_vs_async_trload pipeline > #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using qr_ks_vs_async pipeline for comparing > ``` ## Discussion --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com> Co-authored-by: qianfengz <12429178+qianfengz@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-24 10:30:41 -06:00
Illia Silin	d16061f578	[rocm-libraries] ROCm/rocm-libraries#6550 (commit c396de9) [CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
Sami Remes	de3fa71992	[rocm-libraries] ROCm/rocm-libraries#6611 (commit 5375c0f) [CK_TILE] Preserve input strides in EightWaves async-load descriptor (#6611) `MakeAsyncLoadADramWindow` in `GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes strides from lengths and assumes a dense layout. When the input view's leading-dim stride is larger than its inner length (non-packed memory layout), the resulting tile window stepped through memory at the wrong stride. Compose the unmerge transforms on top of the input view's existing descriptor instead, so the actual runtime strides are preserved and the correct `element_space_size` is inherited for bounds checking. ## Test Plan Added an unit test showing the problem. ## Test Result The new test fails before fixes and passes after. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 12:52:02 +02:00
Hosang Yoon	7e4e291771	[rocm-libraries] ROCm/rocm-libraries#6450 (commit b75fed1) [CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450) ## Motivation `qr_hpad` currently executes work for padded head-dim fragments even when only a subset of the values are valid. This adds unnecessary computation for head dimensions that require padding, such as `hdim=72` and `hdim=80`, and hurts FMHA forward performance. The goal of this PR is to make the padded-head-dim path skip invalid work based on the actual valid fragment count, while preserving the existing behavior for the non-padded path. ## Technical Details This PR improves the `qr_hpad` FMHA forward path in three parts: - Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part of the fragment is valid. - Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the valid fragment range instead of always computing over the padded extent. - Retune the gfx11 `qr_hpad` kernel configuration after enabling the partial-fragment path. To keep the existing path stable, the implementation adds overloads for the updated GEMM/pipeline interfaces. This allows existing full-tile callers to keep using the previous form, while the `qr_hpad` path can pass valid fragment counts when needed. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - On gfx11 and gfx12, for head dimensions that require padding, `tile_example_fmha_fwd` shows about 20-30% performance improvement at `hdim=72/80`. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-18 06:44:46 +00:00
chris-tsiaousis-hpc	1c04029c0e	[rocm-libraries] ROCm/rocm-libraries#5508 (commit 0ad0aca) [CK Tile] Unification work - mma transformations pipeline (#5508) ## Motivation In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing. For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform. A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR. The pipeline base class is basically at the RFC stage. We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline. ## Technical Details The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed. ## Test Plan New tests are added. ## Test Result Tests should pass. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-04-14 09:25:01 +02:00
Aviral Goel	c7eb33078c	[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8) CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00
Christopher Millette	c6e5cd65d3	[rocm-libraries] ROCm/rocm-libraries#5939 (commit 6fb1791) [CK_TILE] Flatten nested static_for loops into static_ford (#5939) ## Summary Mechanical conversion of 129 nested `static_for`/`static_ford` patterns to flat `static_ford` across 29 ck_tile header files. Each conversion eliminates intermediate lambda closure instantiations by replacing nested compile-time loops with a single flat iteration using index decomposition. ### What `static_ford` eliminates When `static_for` loops are nested, each level creates unique closure types: ```cpp // BEFORE: M + M×N = 20 IR functions (for M=4, N=4) static_for<0, 4, 1>{}([&](auto m) { // 4 closure instantiations static_for<0, 4, 1>{}([&](auto n) { // 4×4 = 16 closure instantiations body(m, n); }); }); // AFTER: M×N = 16 IR functions (with ford_applier, no intermediates) static_ford<sequence<4, 4>>{}([&](auto mn) { constexpr auto m = number<mn[number<0>{}]>{}; constexpr auto n = number<mn[number<1>{}]>{}; body(m, n); }); ``` ### Pattern categories converted \| Category \| Count \| Description \| \|----------\|-------\|-------------\| \| C (2-level `static_for` chains) \| 112 \| Nested `static_for` → `static_ford` \| \| C3 (3-level `static_for` chains) \| 9 \| Three consecutive nests → `static_ford` \| \| Partial rescue \| 3 \| Outer 2 levels of blocked 4-level nests \| \| B (nested `static_ford` merge) \| 5 \| Two nested `static_ford` → single higher-dim `static_ford` \| \| Total \| 129 \| Across 29 files \| 6 false positives were detected and reverted (in `tensor_adaptor.hpp`, `tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the inner loop bound depended on the outer variable. ### Files changed by family \| Family \| Files \| Sites \| \|--------\|-------\|-------\| \| Block GEMM \| 12 \| ~20 \| \| FlatMM pipelines \| 4 \| ~69 (including 5 ford-ford merges) \| \| GEMM quant \| 7 \| ~22 \| \| FlatMM kernel \| 1 \| 2 \| \| FMHA \| 1 \| 2 \| \| Reduce/norm \| 2 \| 2 \| \| Epilogue \| 1 \| 1 \| ### Blocked locations from review comments - block_gemm_areg_breg_creg_v1.hpp:356 — BLOCKED: runtime scale loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between every nesting level - block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228 — BLOCKED: `zero_accumulators()` before inner loop; `sched_barrier` + conditional `block_sync_lds()` after inner loop - block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298 — BLOCKED: runtime `CWarpTensor` construction before inner loop; quantization scale application code after inner loop - block_universal_gemm_as_aquant_bs_cr.hpp:277 — BLOCKED: same pattern as above - block_universal_gemm_as_bs_bquant_cr.hpp:367 — BLOCKED: same pattern as above ## Depends on - #5938 ([CK_TILE] Optimize static_ford and sequence compile-time infrastructure) — provides the `ford_applier` that makes these conversions beneficial. Without it, `static_ford` uses a recursive implementation that provides no IR function savings. ## Results (combined with #5938) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|-------------\| \| flatmm \| 161.1 \| 149.0 \| -12.1s \| -7.5% \| YES (p<0.01, 7/7 wins) \| \| universal_gemm \| 225.4 \| 220.3 \| -5.1s \| -2.3% \| YES (p<0.01, 7/7 wins) \| ### IR Function Counts (device trace, gfx942) \| Target \| InstFunc \| CodeGen \| \|--------\|----------\|---------\| \| universal_gemm \| -8.5% \| -9.2% \| \| flatmm \| -7.6% \| -10.5% \| ### ASM Equivalence 5/5 PASS — 650,151 lines verified identical (gfx942). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] ASM equivalence verified (650K lines, gfx942) - [x] Wilcoxon timing verified (7 trials, p<0.01) - [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-04-07 08:36:45 -06:00
Po Yen Chen	bf736dfa74	[rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2) [CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details Warp GEMM: - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries V3 Kernel (`fmha_fwd_v3_kernel.hpp`): - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`): - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`): - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) Codegen (`fmha_fwd.py`): - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor Misc: - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) \| Problem \| Regular (TFlops) \| Varlen (TFlops) \| \|---\|---:\|---:\| \| batch=1 heads=6/1 seqlen=1024 causal \| 48.9 \| 47.6 \| \| batch=1 heads=6/1 seqlen=2048 causal \| 119.8 \| 117.4 \| \| batch=1 heads=6/1 seqlen=4096 causal \| 263.7 \| 259.2 \| \| batch=1 heads=6/1 seqlen=8192 causal \| 548.9 \| 543.6 \| \| batch=1 heads=6/1 seqlen=16384 causal \| 1043.0 \| 1063.7 \| \| batch=1 heads=6/1 seqlen=32768 causal \| 1237.2 \| 1279.6 \| \| batch=1 heads=6/1 seqlen=65536 causal \| 1315.4 \| 1382.7 \| \| batch=1 heads=6/1 seqlen=131072 causal \| 1326.3 \| 1402.2 \| \| batch=1 heads=16/1 seqlen=65536 causal \| 1298.7 \| 1388.4 \| \| batch=1 heads=40/40 seqlen=37200 non-causal \| 1248.9 \| 1326.1 \| ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-07 22:19:28 +08:00
Chinmay Dattanand Kuchinad	9a49642035	[rocm-libraries] ROCm/rocm-libraries#5776 (commit ee1bbcb) [CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-04-01 09:21:20 -07:00
Bartłomiej Kocot	1f30eb4a54	[rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690) [CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842) ## Motivation Force padding for atomic_add bf16 C tensor to avoid memfaults. ## Technical Details - add global atomic add for bf16 and enable them - add padding for atomic add bf16 due to the lack of oob - remove padding for not continous dims in conv for other cases - minor bwd data conv fixes ## Test Plan test_grouped_conv_*_tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 10:02:24 +02:00
Linjun-AMD	7469320248	[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2) [CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849) This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 20:36:39 +00:00
Johannes Graner	d219484b46	[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649) [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 10:17:10 +01:00
joyeamd	cb76c8b8e2	[rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6) [CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-26 09:40:44 +08:00
joyeamd	9abcf2bfa2	[rocm-libraries] ROCm/rocm-libraries#5697 (commit dd1c396) Revert "Ck/joye/revert oob check (#5640)" (#5697) This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 22:04:14 +00:00
Bartłomiej Kocot	6e03aaa690	[rocm-libraries] ROCm/rocm-libraries#5555 (commit 1d2c4c8) [CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555) ## Motivation Fix kbatch check in grouped conv and gemm kernels, allow tails for kbatch. ## Technical Details Round up K / Kperxdl and divide it by Kbatch to allow tail for K. ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-21 23:55:24 +01:00
joyeamd	9da28b776b	[rocm-libraries] ROCm/rocm-libraries#5640 (commit 552ab48) Ck/joye/revert oob check (#5640) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 12:30:08 +00:00
arai713	8cc3cc5578	[rocm-libraries] ROCm/rocm-libraries#4795 (commit 6590a1a) [CK_TILE] Rename Stream-K grid function (#4795) ## Motivation This PR introduces a change in the name of the get_grid function in the Stream-K TilePartitioner to avoid confusion with a similarly named method. In the Stream-K TilePartitioner, there is get_grid() which returns num_cu*occupancy and there is grid_size() which returns the grid size used to launch the kernel. In this PR, we change get_grid() to be get_max_active_wgs() to better reflect what the function returns and not confuse it with grid_size(). ## Technical Details Initially in the Stream-K TilePartitioner we had get_grid() which returned grid_. We are renaming get_grid() to get_max_active_wgs() and grid_ to max_active_wgs_ internally, while keeping grid_size() the same. The parameter, grid, for the Stream-K TilePartitioner remains the same to maintain consistency with the rest of the Stream-K API. ## Test Plan Validated using the test suite that is already present. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 03:27:44 -06:00
Sami Remes	8f0ede3ea2	[rocm-libraries] ROCm/rocm-libraries#5095 (commit 7e55766) [CK_TILE] Enable MXFP6 for MX GEMM op (#5095) ## Motivation Add support for MXFP6 in the MX GEMM op in CK-Tile. Depends on https://github.com/ROCm/rocm-libraries/pull/4594 ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-19 18:07:47 -07:00
Bartłomiej Kocot	e92c207845	[rocm-libraries] ROCm/rocm-libraries#5470 (commit fe3405d) [CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470) ## Motivation Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile window. ## Technical Details - Fix dram step for KM/KN layouts in V1 pipeline - Disable instances which use more threads than warp size in continous dim (not supported in ck tile yet) - Use 1x1 specialization for explicit gemm - Use two stage for vectorsize =1 and sizeof(datatype) ==2 - remove not needed check sinze GetVectorSizeA/B check if vector size is fixed ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-966	2026-03-19 11:59:44 +00:00
assistant-librarian[bot]	544a3182c1	[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a) [CK_TILE] add tf32 support (#4302) ## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion --- 🔁 Imported from [ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538) 🧑‍💻 Originally authored by @yingluAMD --------- Co-authored-by: yingluAMD <Yingmao.Lu@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-19 10:17:20 +01:00
Thomas Ning	1ab29bf22f	[rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e) CK Tile MX GEMM Packing Improvement (#5323) ## Motivation Reduce the scale loading size and also has better utilization of MFMA scale selection. ## Technical Details Add up the packing of mx scales. ## Test Plan Use the existing test cases. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: Enrico Degregori <enrico@streamhpc.com>	2026-03-17 11:57:32 -07:00
Enrico Degregori	7005afd86e	[rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a) [CK Tile] Eight Waves pipeline GEMM (#4964) ## Motivation Eight waves pipeline was added for ABQuant. The goal of this PR is to enable it also for GEMM ## Technical Details Summary: - Block: - Create block struct for GEMM using eight warps specific distribution encodings - Use this block struct in ABQuant for encodings - Pipeline: - Create impl pipeline for eight waves which can be used by GEMM and ABQuant as base (and for AQuant and BQuant in the future) - Create eight waves pipeline for GEMM (this can not be easily integrated in the existing async pipeline) - Pipeline policy: - Extract GEMM specific parts in the ABQuant policy to define GEMM policy (then ABQuant use it as base and add Quant specific methods) - Minor: naming was inconsistent between warp/wave, everything is now referred to as eight waves So overall we have: - block struct directly used by GEMM -> ABQuant derived struct to implement operator - Impl base pipeline with general implementation -> GEMM and ABQuant pipelines use it to avoid code duplication but still define their own pipelines - pipeline policy struct directly used by GEMM -> ABQuant derived policy struct for Quant specific parts ## Test Plan Added new tests for GEMM pipeline: `test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950 supports it). Note: K padding test is disabled for this pipeline because it's not implemented yet ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 09:30:54 +01:00
Adam Osewski	e4b68f386f	[rocm-libraries] ROCm/rocm-libraries#5050 (commit 033dad7) [CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050) ## Motivation It's common in MoE workloads that some experts receive zero tokens, which would result in some of the dimensions equal to zero. Currently we handle such case only for non-persistent kernels where we have all GEMMs information beforehand on host - we validate this during creation of kernel arguments. However for the "dynamic" input path (persistent kernel) this information is not available before kernel launch. Thus we have to validate this during kernel execution. The goal is to add this validation. ## Technical Details Skip work if any of Grouped GEMM groups M/N/K are zero for persistent kernel path. ## Test Plan Add unit-tests which cover "dynamic" inputs with zero dims for persistent kernel execution path. ## Test Result All tests pass. ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 13:28:24 +00:00
chris-tsiaousis-hpc	f43b7421c7	[rocm-libraries] ROCm/rocm-libraries#5241 (commit 43daeac) Changed the include order of the new WMMA/MFMA unification framework (#5241) Those changes are to fix the include order and make header files independent of one another. Also the `remod.py` sript has run and changed the `grouped_convolution.hpp` and `core.hpp` files. ## Motivation Some headers appear to depend on include order. For example, when moving `#include "wmma/wmma.hpp"` in [amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp) later in the include list, it is causing compilation errors. Also the pre-commit script `remod.py` is shuffling includes to be in alphabetical order and is causing compilation issues. Expected behaviour: Headers should be independent of one another: no header should require another to be included first. Each header should compile correctly on its own. ## Test Plan The CI (that runs `remod.py`) should compile. ## Test Result Existing CI should compile and be green. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-03-12 09:26:58 +01:00
Anton Gorenko	5f94c1aa0f	[rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc) [CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 09:59:50 +00:00
Sami Remes	c1525b3f30	[rocm-libraries] ROCm/rocm-libraries#4594 (commit 1fce4cb) [CK_TILE] MX GEMM non-preshuffled RCR layout (#4594) ## Motivation Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled layouts using async pipeline. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2026-03-10 20:12:05 +00:00
Bartłomiej Kocot	e262252c4c	[rocm-libraries] ROCm/rocm-libraries#5115 (commit a21861e) [CK][CK Tile] Add grouped conv backward weight tile test and fix tr load in BASE_V1 pipeline (#5115) ## Motivation Test grouped conv backward weight from ck tile and fix incorrect values. ## Technical Details - Add test for CI - Add daily tests - Fix transpose load in BASE_V1 pipeline ## Test Plan test_grouped_convnd_backward_weight_tile ## Test Result in progress ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-783	2026-03-10 03:03:04 +00:00
chris-tsiaousis-hpc	fe128b581f	[rocm-libraries] ROCm/rocm-libraries#4837 (commit 6316035) [CK TILE] Unification of sparse MFMA/WMMA policy structs (#4837) ## Motivation The existing unification work supports DENSE intrinsics. In this PR we enable support for SPARSE as well as SCALE intrinsics and add an example SPARSE implementation. ## Technical Details Mostly trivial changes. One framework change is that the desired `MmaOpFamily` is passed to the `MmaDefaultSelector`. As my relevant commit explains, we do not support a fallback family at the moment, but it is something we can consider. ## Test Plan Added a new test for the relevant sparse specializations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-03-05 19:52:04 +00:00
Ville Pietilä	68b0f420ba	[rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400) [CK_TILE] Add CK Tile bwd weight profiler (#4797) ## Motivation To compare old CK and CK Tile, we need to extend the current CK profiler to support running also CK Tile instance with the same API. In order to have the same instance coverage in CK Tile compared to the old CK, I've added code generation from old CK configurations to CK Tile instances using the CK Builder. ## Technical Details - The codegen python script for CK Tile fwd convs is extended to support also bwd weight and bwd data. - The generated instances are added to the CMake build (target `device_grouped_conv_bwd_weight_tile_instance`s). - A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to the CK Profiler. --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2026-03-04 21:49:42 +00:00
Emily Martins	11fa35bea6	[rocm-libraries] ROCm/rocm-libraries#4984 (commit 962b047) [CK_TILE] Reduce Register Spills in Stream-K Reductions (#4984) ## Motivation In CK Tile Stream-K, kernels using one of two non-atomic reduction strategies (i.e., linear, tree) have high register spill count, with the tree reduction generally being worse. These changes act a first step to help decrease the register spill count. ## Technical Details ### Problem 1: Unvectorized access to partials In both the linear and tree reductions, workgroups write partials results to a global buffer; another workgroup will later read this data. When the initial logic to support reading and writing to the partials buffer was added (see https://github.com/ROCm/composable_kernel/pull/3107), the tile distribution encoding used to read from and write to partials matches the register layout for the accumulator of the mfma instruction used for the kernel. Since we do not currently use the transposed register layout for the accumulator, we end with an encoding that is not optimized for writing to HBM. For example: Consider the register layout of the `v_mfma_f32_16x16x32_fp8_fp8` instruction. ```bash ./matrix_calculator.py --architecture gfx942 --instruction v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix ``` <img width="1113" height="537" alt="image" src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02" /> The above shows that threads are responsible for consecutive elements down a column of the C tile. If we use this distribution to read and write to partials with C in row major, then threads are unable to perform vectorized reads and writes. Note: thread 0 is shown in red and thread 1 is shown in green. Since the C-shuffle Epilogue only supports C in row major, reading and writing to partials is highly unoptimized. ### Problem 2: Missed opportunity for SPGR use in tree reduction loop Since the reduction occurs between workgroups, all threads in the workgroup follow the same execution paths in the tree reduction logic, hence various variables should be using SGPRs, but they are not. ### Implemented Solutions 1. Add a new tile distribution encoding that is optimized for accessing partials in HBM. This encoding does not change the data assignment to threads, it merely changes the addresses to which they write/read in the partials buffer. For example, continuing with the `v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result in threads writing in the following layout: <img width="517" height="342" alt="image" src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202" /> This layout ensures that each thread writes along a row, enabling `buffer_{store\|load}_dwordx4` instructions (i.e., vectorized accesses). This helps reduce register usage due to requiring fewer offset calculations. 2. To force SGPR usage in the tree reduction loop, I make use of CK Tile's `amd_wave_read_first_lane` which is a wrapper around `__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the tree reduction. _These changes do not fully eliminate register spills. Future work will aim to further reduce spills. But these changes make good progress._ ## Test Plan Added tests for different warp tile sizes to validate that the new encoding works with different `WarpGemm` variants. ## Test Result All tests pass locally on all gfx9 architectures. Some results for decreases in register spills on gfx942: (BL = baseline) \| Kernel \| SGPR Spill (BL) \| SGPR Spill (new) \| SGPR Delta \| SGPR % \| VGPR Spill (BL) \| VGPR Spill (new) \| VGPR Delta \| VGPR % \| \|--------\|------------------:\|------------------:\|-----------:\|-------:\|-------------------:\|------------------:\|-----------:\|-------:\| \| fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 \| 223 \| 0 \| -223 \| -100.0% \| 21 \| 20 \| -1 \| -4.8% \| \| fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 \| 233 \| 11 \| -222 \| -95.3% \| 443 \| 23 \| -420 \| -94.8% \| \| fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 \| 221 \| 3 \| -218 \| -98.6% \| 12 \| 6 \| -6 \| -50.0% \| \| fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 \| 230 \| 14 \| -216 \| -93.9% \| 396 \| 12 \| -384 \| -97.0% \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-02 10:39:48 -07:00
assistant-librarian[bot]	481ca26b93	[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702) Cleanup and refactoring related to tile loading (#4294) ## Proposed changes Cleanup and refactoring done while implementing mixed precision for fp16/bf16 x fp8 Key changes: - Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and refactored the API to use consistent naming conventions - Updated load_tile_transpose functions to use output parameters instead of return values for consistency - Removed unused variable declarations and simplified type deduction logic - Define load_tile_with_elementwise to use tuple types explicitly for clarity ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3505](https://github.com/ROCm/composable_kernel/pull/3505) 🧑‍💻 Originally authored by @SamiAario-AMD --------- Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-02 12:20:55 +00:00
Bartłomiej Kocot	2e92a9ab11	[rocm-libraries] ROCm/rocm-libraries#4963 (commit cb6bbf6) [CK][CK Tile] Fix batched gemm kernel 2 lds (#4963) ## Motivation Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Technical Details Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Test Plan CI overall ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-27 23:16:19 +01:00
Thrupti Raj Lakshmana Gowda	720d7fa02b	[rocm-libraries] ROCm/rocm-libraries#4592 (commit 45f76cb) Tile Engine support for gfx950 (#4592) ## Motivation This PR adds support for the gfx950 GPU architecture to the Tile Engine in Composable Kernel library, focusing on GEMM operations with FP8 and BF8 data types. ## Technical Details Added gfx950-specific MFMA warp GEMM implementations with conditional compilation. Updated default GEMM configuration parameters for tile sizes and warp configurations. Added Jenkins CI pipeline stage for testing TILE_ENGINE_GEMM on gfx950 hardware. ## Test Plan Tile engine itself is a benchmarking utility, so if it passes the CI it will be tested automatically. ## Test Result Tile engine itself is a benchmarking utility, so if it passes the CI it will be tested automatically. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Thrupti Raj Lakshmana Gowda<ThruptiRaj.LakshmanaGowda@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-26 10:14:40 -06:00
Bartłomiej Kocot	d244aaa1c0	[rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6) [CK][CK TILE] Improve oob check (#4791) ## Motivation Improve OOB checks. Remove permutes which have been generated by thread buffer zero clear. at now in assembly there is only condmask instead of permute + condmask. Change number of KPack for generated instances ## Technical Details Remove permute instructions from assembly ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com>	2026-02-24 22:40:48 +01:00
assistant-librarian[bot]	c28d5c3f74	[rocm-libraries] ROCm/rocm-libraries#4295 (commit fa2cfc8) [CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to allow caller to pass desciptors directly (#4295) ## Proposed changes Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create tensor views from strides and sizes. This refactors the descriptor creation out and add overloaded definitions, allowing descriptors to be created separately by the caller instead of passing explicit strides, with no functional changes. This will enable further refactoring of `RunGemm` to do likewise, enabling derived kernels like BatchedContractionKernel to avoid creating separate versions (PR [#3457](https://github.com/ROCm/composable_kernel/pull/3457)). ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion Since the logic within the MakeXBlockviews chains together operations on tuples, and thus the descriptors are also passed as such, adding a template parameter for the type of the input tuple was the simplest option to enable the overload without too much verbiage. However, for `MakeCBlockView` this adds a complications as the templated definitions are prone to overlap. This for now is avoided by just moving the arguments around for the descriptor version, which avoids the collision. It's not a great solution, so feel free to suggest a better one. --- 🔁 Imported from [ROCm/composable_kernel#3467](https://github.com/ROCm/composable_kernel/pull/3467) 🧑‍💻 Originally authored by @amd-meskelin --------- Co-authored-by: Matti Eskelinen <matti.eskelinen@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-24 12:43:38 -08:00
assistant-librarian[bot]	023ba6848e	[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e) [CK_TILE] Extend support of mix precision microscaling BQuant (#4267) ## Proposed changes Supported types combinations using BQuant=e8m0: - A=bf16 - B=bf16,bf8,fp4 Summary: - remove usage of `pk_fp4_raw_t`: consistent with other implementations and avoid taking into account of the packed size explicitly. In general, the raw type should not be used because CK Tile internally takes care of the PackedSize, so using the raw type adds unnecessary complexity to the implementation - handle microscaling by checking for `e8m0` type for BQuant (previous implementation was inconsistent) - add support for scaling instructions in `DequantPack8` - mx pipeline: - extend existing pipeline to support different B types - add support to scale and cast before writing to LDS or after reading from LDS (this can be defined in the `Problem` by the user) - block gemm: - mx pipeline is now using block gemm BQuant - block gemm BQuant can now load from LDS and apply scale and then call block gemm universal operator. This adds new functionalities and remove code duplication - warp gemm: - add case to support 128bit ds_read/write for both A and B when A=16bit and B=8bit - add examples and tests: note that some tests for bf16/fp4 already existed but were removed during previous tests refactoring. I added them again and other relevant tests for new types combinations ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689) 🧑‍💻 Originally authored by @EnricoDeg --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-24 09:55:50 -08:00
Emily Martins	40552f0ec6	[rocm-libraries] ROCm/rocm-libraries#4756 (commit 79bc2ca) [CK_TILE] Update Stream-K Reduction Strategy Enum (#4756) ## Motivation Currently, Stream-K has 3 reduction options: 1) atomics, 2) The reduction described in the Stream-K paper, and 3) a tree reduction. The reduction strategy described in the original Stream-K paper has the starting workgroup of each tile sequentially accumulating partial results of other contributing workgroups in the tile, which requires a linear number of steps. Hence, for clarity, this works updates the naming of the `StreamKReductionStrategy` enum members to better describe the existing reduction strategy options. ## Technical Details Prior to this change, the enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Reduction = 1u, TreeReduction = 2u }; ``` But, the distinction between `Reduction` and `TreeReduction` is not very clear and has some redundancy. Hence, the updated enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Linear = 1u, Tree = 2u }; ``` All references to `StreamKReductionStrategy` were updated to reflect this change. ## Test Plan No new functionality was added, so no new tests were added; I just validated existing tests and examples. ## Test Result All tests passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-24 06:40:08 +00:00

1 2 3 4 5 ...

266 Commits