composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 10:37:44 +00:00

Author	SHA1	Message	Date
Illia Silin	cfb09d76a5	[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 ) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
Chinmay Dattanand Kuchinad	820ed2dbb3	[CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776 ) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-04-01 09:21:20 -07:00
Bartłomiej Kocot	f14ee90152	[CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842 ) ## Motivation Force padding for atomic_add bf16 C tensor to avoid memfaults. ## Technical Details - add global atomic add for bf16 and enable them - add padding for atomic add bf16 due to the lack of oob - remove padding for not continous dims in conv for other cases - minor bwd data conv fixes ## Test Plan test_grouped_conv_*_tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 10:02:24 +02:00
Johannes Graner	c60514f371	[CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393 ) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 10:17:10 +01:00
joyeamd	c5202aada0	[CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789 ) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-26 09:40:44 +08:00
joyeamd	393ce1d23f	Revert "Ck/joye/revert oob check (#5640 )" (#5697 ) This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 22:04:14 +00:00
Bartłomiej Kocot	ce4525e82b	[CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555 ) ## Motivation Fix kbatch check in grouped conv and gemm kernels, allow tails for kbatch. ## Technical Details Round up K / Kperxdl and divide it by Kbatch to allow tail for K. ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-21 23:55:24 +01:00
joyeamd	d2b129f56b	Ck/joye/revert oob check (#5640 ) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 12:30:08 +00:00
arai713	121fa1a503	[CK_TILE] Rename Stream-K grid function (#4795 ) ## Motivation This PR introduces a change in the name of the get_grid function in the Stream-K TilePartitioner to avoid confusion with a similarly named method. In the Stream-K TilePartitioner, there is get_grid() which returns num_cu*occupancy and there is grid_size() which returns the grid size used to launch the kernel. In this PR, we change get_grid() to be get_max_active_wgs() to better reflect what the function returns and not confuse it with grid_size(). ## Technical Details Initially in the Stream-K TilePartitioner we had get_grid() which returned grid_. We are renaming get_grid() to get_max_active_wgs() and grid_ to max_active_wgs_ internally, while keeping grid_size() the same. The parameter, grid, for the Stream-K TilePartitioner remains the same to maintain consistency with the rest of the Stream-K API. ## Test Plan Validated using the test suite that is already present. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 03:27:44 -06:00
Adam Osewski	7a12ecc762	[CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050 ) ## Motivation It's common in MoE workloads that some experts receive zero tokens, which would result in some of the dimensions equal to zero. Currently we handle such case only for non-persistent kernels where we have all GEMMs information beforehand on host - we validate this during creation of kernel arguments. However for the "dynamic" input path (persistent kernel) this information is not available before kernel launch. Thus we have to validate this during kernel execution. The goal is to add this validation. ## Technical Details Skip work if any of Grouped GEMM groups M/N/K are zero for persistent kernel path. ## Test Plan Add unit-tests which cover "dynamic" inputs with zero dims for persistent kernel execution path. ## Test Result All tests pass. ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 13:28:24 +00:00
Emily Martins	e7f9244c20	[CK_TILE] Reduce Register Spills in Stream-K Reductions (#4984 ) ## Motivation In CK Tile Stream-K, kernels using one of two non-atomic reduction strategies (i.e., linear, tree) have high register spill count, with the tree reduction generally being worse. These changes act a first step to help decrease the register spill count. ## Technical Details ### Problem 1: Unvectorized access to partials In both the linear and tree reductions, workgroups write partials results to a global buffer; another workgroup will later read this data. When the initial logic to support reading and writing to the partials buffer was added (see https://github.com/ROCm/composable_kernel/pull/3107), the tile distribution encoding used to read from and write to partials matches the register layout for the accumulator of the mfma instruction used for the kernel. Since we do not currently use the transposed register layout for the accumulator, we end with an encoding that is not optimized for writing to HBM. For example: Consider the register layout of the `v_mfma_f32_16x16x32_fp8_fp8` instruction. ```bash ./matrix_calculator.py --architecture gfx942 --instruction v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix ``` <img width="1113" height="537" alt="image" src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02" /> The above shows that threads are responsible for consecutive elements down a column of the C tile. If we use this distribution to read and write to partials with C in row major, then threads are unable to perform vectorized reads and writes. Note: thread 0 is shown in red and thread 1 is shown in green. Since the C-shuffle Epilogue only supports C in row major, reading and writing to partials is highly unoptimized. ### Problem 2: Missed opportunity for SPGR use in tree reduction loop Since the reduction occurs between workgroups, all threads in the workgroup follow the same execution paths in the tree reduction logic, hence various variables should be using SGPRs, but they are not. ### Implemented Solutions 1. Add a new tile distribution encoding that is optimized for accessing partials in HBM. This encoding does not change the data assignment to threads, it merely changes the addresses to which they write/read in the partials buffer. For example, continuing with the `v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result in threads writing in the following layout: <img width="517" height="342" alt="image" src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202" /> This layout ensures that each thread writes along a row, enabling `buffer_{store\|load}_dwordx4` instructions (i.e., vectorized accesses). This helps reduce register usage due to requiring fewer offset calculations. 2. To force SGPR usage in the tree reduction loop, I make use of CK Tile's `amd_wave_read_first_lane` which is a wrapper around `__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the tree reduction. _These changes do not fully eliminate register spills. Future work will aim to further reduce spills. But these changes make good progress._ ## Test Plan Added tests for different warp tile sizes to validate that the new encoding works with different `WarpGemm` variants. ## Test Result All tests pass locally on all gfx9 architectures. Some results for decreases in register spills on gfx942: (BL = baseline) \| Kernel \| SGPR Spill (BL) \| SGPR Spill (new) \| SGPR Delta \| SGPR % \| VGPR Spill (BL) \| VGPR Spill (new) \| VGPR Delta \| VGPR % \| \|--------\|------------------:\|------------------:\|-----------:\|-------:\|-------------------:\|------------------:\|-----------:\|-------:\| \| fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 \| 223 \| 0 \| -223 \| -100.0% \| 21 \| 20 \| -1 \| -4.8% \| \| fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 \| 233 \| 11 \| -222 \| -95.3% \| 443 \| 23 \| -420 \| -94.8% \| \| fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 \| 221 \| 3 \| -218 \| -98.6% \| 12 \| 6 \| -6 \| -50.0% \| \| fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 \| 230 \| 14 \| -216 \| -93.9% \| 396 \| 12 \| -384 \| -97.0% \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-02 10:39:48 -07:00
Bartłomiej Kocot	b16838eb6e	[CK][CK Tile] Fix batched gemm kernel 2 lds (#4963 ) ## Motivation Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Technical Details Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Test Plan CI overall ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-27 23:16:19 +01:00
assistant-librarian[bot]	6068389ba1	[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to allow caller to pass desciptors directly (#4295 ) ## Proposed changes Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create tensor views from strides and sizes. This refactors the descriptor creation out and add overloaded definitions, allowing descriptors to be created separately by the caller instead of passing explicit strides, with no functional changes. This will enable further refactoring of `RunGemm` to do likewise, enabling derived kernels like BatchedContractionKernel to avoid creating separate versions (PR [#3457](https://github.com/ROCm/composable_kernel/pull/3457)). ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion Since the logic within the MakeXBlockviews chains together operations on tuples, and thus the descriptors are also passed as such, adding a template parameter for the type of the input tuple was the simplest option to enable the overload without too much verbiage. However, for `MakeCBlockView` this adds a complications as the templated definitions are prone to overlap. This for now is avoided by just moving the arguments around for the descriptor version, which avoids the collision. It's not a great solution, so feel free to suggest a better one. --- 🔁 Imported from [ROCm/composable_kernel#3467](https://github.com/ROCm/composable_kernel/pull/3467) 🧑‍💻 Originally authored by @amd-meskelin --------- Co-authored-by: Matti Eskelinen <matti.eskelinen@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-24 12:43:38 -08:00
Emily Martins	cf00dc87d0	[CK_TILE] Update Stream-K Reduction Strategy Enum (#4756 ) ## Motivation Currently, Stream-K has 3 reduction options: 1) atomics, 2) The reduction described in the Stream-K paper, and 3) a tree reduction. The reduction strategy described in the original Stream-K paper has the starting workgroup of each tile sequentially accumulating partial results of other contributing workgroups in the tile, which requires a linear number of steps. Hence, for clarity, this works updates the naming of the `StreamKReductionStrategy` enum members to better describe the existing reduction strategy options. ## Technical Details Prior to this change, the enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Reduction = 1u, TreeReduction = 2u }; ``` But, the distinction between `Reduction` and `TreeReduction` is not very clear and has some redundancy. Hence, the updated enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Linear = 1u, Tree = 2u }; ``` All references to `StreamKReductionStrategy` were updated to reflect this change. ## Test Plan No new functionality was added, so no new tests were added; I just validated existing tests and examples. ## Test Result All tests passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-24 06:40:08 +00:00
assistant-librarian[bot]	72871f5276	moe flatmm xcd remap (#4297 ) co-authors: @Chi-Chu319 @juuso-oskari Added XCD remapping for flatmm moe <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Arial, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} --> </style> </head> <body link="#467886" vlink="#96607D"> batch \| Mixtral (tflops, wip_355) \| Mixtral-7B (tflops, our branch) \| perf boost -- \| -- \| -- \| -- 64 \| 865.424 \| 995.455 \| 15.0% 256 \| 886.336 \| 1020.96 \| 15.2% 1024 \| 890.808 \| 1022.53 \| 14.8% </body> </html> --- 🔁 Imported from [ROCm/composable_kernel#3161](https://github.com/ROCm/composable_kernel/pull/3161) 🧑‍💻 Originally authored by @Chi-Chu319 --------- Co-authored-by: Tianxing Wu <chi0chu319@gmail.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-18 11:32:15 -08:00
Emily Martins	b6f1e99074	[CK_TILE] Fix alignment in Stream-K workspace buffer (#3625 ) * Fix alignment issue in Stream-K workspace buffer In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests. * Compute reference GEMM on GPU for test verification to decrease testing time [ROCm/composable_kernel commit: `f5c2f09036`]	2026-01-23 16:14:22 -07:00
Max Podkorytov	8b842250da	Add persistent async input scheduler for GEMM kernels (#3520 ) Add signal-based synchronization for persistent GEMM kernels where input data becomes available incrementally. Uses modulo wraparound (like PyTorch's AsyncMM) for chunk index calculation: chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks Key components: - PersistentAsyncInputScheduler struct with tiles_per_chunk_m, chunk_signals, tile_idx_pivot_m, and num_chunks fields - wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency - IsSupportedArgument validation for scheduler parameters - Example demonstrating async input scheduling with simulated producer - GTest unit tests covering all layout combinations [ROCm/composable_kernel commit: `91b4102a59`]	2026-01-20 10:37:09 -08:00
Aviral Goel	3096269434	refactor: remove Default scheduler implementation as it not used anymore (#3542 ) * refactor: remove Default scheduler implementation as it not used anymore * refactor: remove dead code from gemm universal kernel * chore: add descriptive comments about amd intrinsic hardware sync instructions * fix: label existing memory pipeline for aquant as intrawave [ROCm/composable_kernel commit: `e809861d49`]	2026-01-12 09:51:06 -08:00
joyeamd	00d05ab32e	Merge some updates for ck_tile headers (#3342 ) * fix some issues from internal branch * update cshuffle_epilogue * update cshuffle_epilogue * update cshuffle * update warp_gemm [ROCm/composable_kernel commit: `b78563b3d3`]	2026-01-05 23:39:00 -08:00
joyeamd	3b5f2b2d99	Joye/revise wp pipeline (#3493 ) * [CK_TILE] unify double and single lds implementation (#108) Unify LDS buffer management API for single and double buffering modes This change consolidates the Local Data Store (LDS) buffer management by: Merging single and double LDS buffer APIs into a unified interface Implementing ping-pong address calculation in pipeline when double LDS is enabled Computing pong buffer addresses dynamically using base address offsets --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update wp_pipeline * fix a c++17 issue * update for ci errors * fix ci issues * include a header to fix ci errors * fix some rebase issues * update with rebase --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `2b563ad048`]	2026-01-05 13:49:26 -08:00
Max Podkorytov	ece5bd6435	[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder * hacky poc * fix tile engine * kill the lambda * maybe fix test build * more fixes * clang-format * save temp * clang-format * mostly fix examples * clang-format * remove dead code * more cleanup * fix fmha bwd build (default epilogue set/add appears to be broken) * fix default epilogue tests but not correctness * clang-format * fix bquant * clang-format * cleanup dead code * rearrange make windows for readability * restore changes to IsSupportedArgument * fix smoke-builder * clang-format * fixup rename class * build fixes * clang-format * fix builder * fixup * remove set from builder tests * fix test * clang-format * re-refactor the kernels * clang-format * fix header license * remove memory operation from conv bwd test * clang-format * clang-format example,include * clang-format test * build fixes * clang-format * solve compilation error * fix the CI * solve compilation error * clang format * solve merge conflict * solve merge conflict * solve the gfx11 error * solve test error * moar build fixes * remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `e339101e9c`]	2026-01-04 03:28:14 -08:00
jakpiase	07b16d48e4	[CK_TILE] Minor splitk bugfix for gemms and conv (#3387 ) * fix for splitk if splitk < grid * add different splitk implementation * minor bugfix for streamk gemm * Add test --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `c0797c1671`]	2025-12-24 00:10:13 +01:00
Bartłomiej Kocot	a45c051ac9	[CK TILE][AICK-439] Fix cshuffle epilogue wave per shuffle (#3364 ) * [CK TILE] Fix cshufle epligoue wave per shuffle * Align shuffle per tile with smem * fixes * Fixes for double smem * fix [ROCm/composable_kernel commit: `3b773109e5`]	2025-12-15 12:59:48 +01:00
Emily Martins	eeb78c46a4	[CK_TILE] Stream-K Tree Reduction and Cache Skipping Integration (#3371 ) * CK Tile Stream-K Tree Reduction This change adds the first implementation of the Stream-K tree reduction strategy into CK Tile. The tree reduction reduces the the number of steps for accumulating results for a tile from O(N) to O(logN) where N is the number of workgroups contributing to a C tile. Additionally, in the original non-atomic reduction strategy, atomics were used to set the flags buffer and to read from the flags buffer. Howeover, through investigation with the tree reduciton, atomics with default (relaxed) semantics were not enough to guarantee workgroups would not read stale data, leading to incorrect results. Stronger acquire/release memory orderings are too expensive. So, this change also eliminates the use of atomics for setting the flags. Instead, we leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby avoiding the use of atomics. Prelimiary tests were also added for the normal reduction and tree reduction. More will be added in a future PR via tile engine. * Move Stream-K kernel files to a subdirectory * Cleanup Code Style & Handle Unsupported Reductions This change makes the following small changes: - Add an explicit else block for unimplemented reduction strategies - Clarify type of sk_flags_ptr via auto* - Add description for extra_iters_before_me variable * Run new copyright script on new files [ROCm/composable_kernel commit: `22b945e06e`]	2025-12-14 14:49:49 -07:00
linqunAMD	245c274287	[CK_TILE] Port hw independent changes from internal repo to develop branch (#3301 ) * [CK_TILE] Port hw independent changes from internal repo to develop branch It includes PR#96, #114, #120, #121. * correct rebase error [ROCm/composable_kernel commit: `fc7bf0ab1c`]	2025-12-12 09:28:37 -08:00
Cong Ma	fa1c7bc6ba	Tile engine for streamk (#3157 ) * [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM. - This commit lays the groundwork for integrating the tile engine into streamk GEMM. It focuses on creating benchmark executables for streamk GEMM. - Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once the streamk implementation reaches stability. * [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM * [CK TILE STREAMK] Refactor: Extract common utility functions. * [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation * Add pre-commit * [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK * [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK * [CK TILE STREAMK] Update Jenkinsfile * [CK TILE Engine] Update StreamK tile engine help message Remove default value messages as they are automatically printed * [CK TILE Engine] Update StreamK tile engine - Remove namespace reboot * [CK TILE Engine] Update StreamK tile engine - Fix merge error [ROCm/composable_kernel commit: `30727c48fc`]	2025-11-27 15:49:57 -07:00
Aviral Goel	216c23b945	chore(copyright): update copyright header for include directory (#3293 ) [ROCm/composable_kernel commit: `de6466481f`]	2025-11-26 11:00:05 -07:00
Bartłomiej Kocot	2c2672ff0e	[CK TILE] Grouped Conv Explicit Gemm (#3289 ) * [CK TILE] Grouped Conv Explicit Gemm * fixes * apply builder fixes [ROCm/composable_kernel commit: `00dfa2f2ce`]	2025-11-25 23:28:35 +01:00
Emily Martins	0d6a0a3c2f	Fix CK Tile DP + 2 Tile Stream-K Validation Errors (#3269 ) When there are multiple workgroups contributing to a tile, when using atomics, there may be round off error in cases where the accumulator type is not the same as the C type. To compute an error tolerance for test validation, the Stream-K Tile Partitioner has a function called estimate_num_wgs_per_tile to estimate the number of workgroups per tile. That said, this function only provides an estimate. In some cases for DP+2TSK, the function returns 1 rather than the more accurate value of 2. Thus, this change updates the estimate_num_wgs_per_tile function to explicitely return the value of 2 in cases for DP+2TSK to ensure that we have a better error tolerance to avoid test failures due to round-off error. [ROCm/composable_kernel commit: `02ab76c2cb`]	2025-11-21 20:29:47 -07:00
Emily Martins	4aa8d64c9a	[CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202 ) * Remove old CK Tile Stream-K implementation The original CK Stream-K implementation was based on old CK's Stream-K block to C tile map. However, this implementation did not align with the original Stream-K paper. Thus, we implemented a new tile partitioner and associated Stream-K kernel, which was placed in the reboot namespace. Now that the new Stream-K implementation is ready, this change removes all artifacts of the old implementation. Specifically, the following changes were made: - Removes old Stream-K tile partitioner from CK Tile - Removes the reboot namespace such that the new implementation resides in the ck_tile namespace only. - Adds tests for bf8 and fp8 using the new implementation - Removes tests for the old implementation - Remove the v2 suffix from the new CK Tile Tile Partitioner derived classes. - Updates Stream-K Kernel ops file to use /** commenting style. * Remove v2 from tile partitioner validation function names [ROCm/composable_kernel commit: `2e4b8a8fc4`]	2025-11-20 09:32:32 -07:00
linqunAMD	ac0fb4fec5	[ck_tile] enable test grouped_gemm_quant and gemm_streamk on gfx12 (#3196 ) 1. Enable grouped_gemm_quant and gemm_streamk on gfx12 - test_ck_tile_streamk_smoke is kept on gfx9, since it looks someone is still working on it. 2. Update warp tile size in grouped_gemm_quant and gemm_streamk unit test 3. Reduce gemm tile size to pass the build on gfx12 in test_gemm_streamk_reboot_types.hpp [ROCm/composable_kernel commit: `d2e32b4305`]	2025-11-20 08:40:27 +08:00
Max Podkorytov	3774b900d1	[CK-Tile] Remove usage of tile partitioner's full gemm shape (#3204 ) gemm shape should be used from the pipeline instead (where it gets from a problem description struct) [ROCm/composable_kernel commit: `a3a4eb12bd`]	2025-11-18 09:56:40 -08:00
Bartłomiej Kocot	a2a69e7649	[CK_BUILDER] Add grouped conv fwd ck tile traits (#3183 ) * [CK BUILDER] Add grouped conv fwd ck tile traits * Update instance_traits_tile_grouped_convolution_forward.hpp * Update grouped_convolution_forward_kernel.hpp [ROCm/composable_kernel commit: `92c1f4981a`]	2025-11-11 13:55:33 -08:00
Cong Ma	0343c4e1fe	Introduces the new partitioner to implement the reduction StreamK kernel. (#3107 ) * Introduces the new partitioner to implement the reduction StreamK kernel * Add more doc text to functions * Add persistent-dp option to streamk example * Update example/ck_tile/40_streamk_gemm/README.md [ROCm/composable_kernel commit: `5abe4109e0`]	2025-11-04 10:32:17 -07:00
Mateusz Ozga	8eb813de42	[CK_TILE] Fixed multi-abd GEMM test, NaN problem (#2979 ) * Multi-ABD NaN problem * Rollback tests --------- Co-authored-by: root <root@splinter-126-008d.aus.dcgpu> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `da4247a6df`]	2025-10-28 15:53:36 +01:00
arai713	d06d23ab11	[CK_TILE] Stream-K operator() Reboot (#3064 ) * Persistent Stream-K Kernel Implementation This change implements an operator() function in the reboot::StreamKKernel class that is enabled when the Persistent flag is set to true. In this case, the data-parallel portion and the Stream-K portion of the kernel are fully persistent. The changes were made in the reboot namespace. A future PR will remove the old Stream-K kernel class and remove the reboot namespace. * Unit Tests for Persistent Stream-K Kernel This change contains the inital test suite for the Persitent Stream-K Kernel. The files contain "reboot" in the name; a future PR will remove tests for the old Stream-K Kernel and remove the "reboot" naming. A future commit will add tests for the non-persistent kernel. Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase class. This allows us to estimate the number of accumulations done per macro tile in C to use during validation when computing relative and absolute tolerance. * Adding implementation for the Non-Persistent Stream-K kernel This code is adding the operator() function for the Non-Persistent Stream-K kernel. Persistency of the kernel is determined through a template argument. The Non-Persistent kernel will allocate additional workgroups for the data parallel section, leading to a different structure for processing the data parallel and Stream-K sections. There has been an addition to the TilePartitioner to get access to the whether Persistent has been set to true or false in the StreamKKernel. * Adding in the tests for the Non-Persistent Stream-K kernel * Refactor Stream-K Reboot Unit Tests This commit makes the following changes: - Update test cases to determine M, N, and K based on the number of CUs. This ensures that each test case is one of Edge Case, SK Only, DP Only, or DP + 2 Tile SK regardless of the architecture. - Since the DP + 2 Tile SK test case takes long to run, this change moves this case into a separate .inc file and labels it as an extended test. - Since the extended test takes > 30 seconds to run, this test is added to the list of regression tests. * Fix spelling errors in comments for test cases Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Changes based on review Removed const volatile for typenames Set up alias for is_tuple_t Naming changes for clarity: GemmCommon -> BaseGemm Moved std::enable_if_t out of template parameters and changed to a return type for operator() Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance --------- Co-authored-by: Emily Martins <emily.martins@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `054fdb765c`]	2025-10-27 09:14:17 -07:00
Emily Martins	36020b389c	Style updates and cleanup The following changes were made - Renamed iter to iter_start - Renamed tile_iter to tile_iter_start - Moved documentation from member variables to getters - Removed double underscore from extra_iters_before_me variable - Defined parent header in impl file - Removed unused inlcudes [ROCm/composable_kernel commit: `cb83d52301`]	2025-10-16 08:47:06 -06:00
Astha	1d1f8af58b	Addition of the derived structs for the new Stream-K TilePartitioner There are 2 derived structs based on whether Stream-K is persistent or not. If it's persistent that means that both the data parallel and Stream-K sections are data parallel. If it's non-persistent that means that only the Stream-K section is persistent, while the data parallel section will have separate workgroups allocated for it. Both structs will have a template argument for Persistent. The 2 derived classes will inherit common variables and functions from the Stream-K TilePartitioner base class. There are additional variables for the differing data parallel sections that will be added to each derived class, that are in charge of the indexing/bookkeeping for the data parallel sections. The only additional function that will differ between the 2 structs is GridSize(), as the non-persistent will allocate extra workgroups for data parallel. Unit tests for the derived structs are included. [ROCm/composable_kernel commit: `8f75d7cea6`]	2025-10-16 08:47:06 -06:00
Emily Martins	64e6fef4ba	Stream-K Tile Partitioner Base Class with Tests To better align with the original Stream-K paper, this change implements a new Stream-K tile partitioner base class. This class will handle the Stream-K setup that is common to both a persistent and non-persistent DP section. A later change will implement derived classes to handle the differences between persistent and non-persistent DP. This change also includes unit tests for the base tile partitioner. [ROCm/composable_kernel commit: `f87f768d16`]	2025-10-16 08:47:06 -06:00
aledudek	f1c8acbd71	[CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860 ) * Add valid check batched gemm part1 * [CK_TILE] Add batched gemm kernel IsSupported func checks * revert broken pre-commit hook changes * revert broken pre-commit hook changes v2 * Clarify error messages [ROCm/composable_kernel commit: `3021604213`]	2025-10-13 13:55:23 +02:00
Christopher Millette	31f0642364	Streamk functional tests (#2974 ) * Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite * Account for stride when computing K offsets for A and B tensor This change ensures that the correct stride is used when computing the K offsets into the A and B tensors in the Stream-K Kernel's operator() function. This ensures that the kernel executes correct regardless of whether A and B are row or column major. * Move helper code to test_gemm_streamk_util.hpp * Separate tests into smoke/regression/extended. Add bf16 datatype * Run clang-format * Refactor combinatorial macro expansion and naming * Adjust the initialization values to account for better tolerance on bf16 * Correct BF16 datatypes in comments * Move the extended tests under the REGRESSION_TESTS label * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Emily Martins <emily.martins@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f5708882a3`]	2025-10-11 07:53:40 -05:00
Thomas Ning	192536597e	add the sync barrier for persistent kernel (#2977 ) [ROCm/composable_kernel commit: `ae9f29b7d5`]	2025-10-07 11:54:04 -07:00
Aviral Goel	a69e4ed8b7	Extend Grouped GEMM with MultiD (Single & Double Shared Memory) feature to use persistent kernel option (#2933 ) * feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature * refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel * tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments * fix: segfault fix by passing correct parameters for d tensors * style: clang format * WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults * feat(grouped_gemm_multi_d): add functionality to run persistant kernel * feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature * refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel * tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments * fix: segfault fix by passing correct parameters for d tensors * style: clang format * fix: incorrect validation method and Dtensor layout in test suite * docs: improved README text based on review comments * fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint [ROCm/composable_kernel commit: `bebf0e9d15`]	2025-09-29 15:03:56 -07:00
Aviral Goel	e98edd3322	Integrate Multi D GEMMs into Grouped GEMMs along with unit tests (#2923 ) * feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature * feat: generalized grouped_gemm_kernel.hpp * feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0 * refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel * tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments * fix: segfault fix by passing correct parameters for d tensors * docs: add multi d info and trim down outdated content * tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility * style: clang format * fix: incorrect validation method and Dtensor layout in test suite [ROCm/composable_kernel commit: `a44bea45b2`]	2025-09-26 09:59:58 -07:00
Khushbu Agarwal	9ed178a93e	Fix for Add the API to load SGPR (#2913 ) * Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)" This reverts commit 4c78cc31c5b8e0c9db09c24fa35393f603a8a47f. * Fix: sgpr minor issue * cyclic dependency resolved * clang formatted * removing unused variable * clang formatted --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b56e5d1d79`]	2025-09-25 10:32:42 -07:00
Thomas Ning	bdea637a15	Fix the gfx950 numerical errors (#2911 ) * Update grouped_gemm example and pipeline * find the root cause error in did not enable the transpose in gfx950 correctly * Fix v3 pipeline, row and col major * Disable f8 datatype tests, it fails on gfx950 * fix the abd test by clear the runtime argument unsupported --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com> [ROCm/composable_kernel commit: `b159841a06`]	2025-09-23 22:54:52 -07:00
asleepzzz	651a5dd0b9	Revert "[CK-Tile] Add the API to load SGPR (#2878 )" (#2904 ) This reverts commit 3e008a2d22ad1ba8a9b2c7eca369a8593b7d6e95. [ROCm/composable_kernel commit: `f161b5b738`]	2025-09-23 14:33:51 -07:00
Thomas Ning	e3702467d5	[CK-Tile] Add the API to load SGPR (#2878 ) * Have a workable version for SGPR * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * substitute with the new sgpr read api * update the CHANGELOG * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * change to static for logic * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. [ROCm/composable_kernel commit: `2cbbf5dcb3`]	2025-09-23 01:23:56 -07:00
Mateusz Ozga	2a150508c8	[CK_TILE] Multiple-ABD GEMM example (#2788 ) * Multi ABD - initial commit * Clang-foramt fix * block gemm, unify the name of CDataType * Apply chnages to mem-pipeline * Rollback prefix for DType and Layout * Gemm Kernel Basic, rename * WMMA config * Grouped GEMM * Clang-format * Dropout, name * Review v2 * Move element_wise fn to unnary, remov old ones fn * clang-format * Fix issue review * WP operator adjust to universal gemm * v2 prepare * Remove unused comment * Remove vectorsize * Rollback * Adjust pipeline for abd * Shuffle argument * CI-fail fix quant * Fix ag_br pipeline * Failing tests * Typo * Single argument support [ROCm/composable_kernel commit: `30ab1d6a71`]	2025-09-19 01:14:11 +02:00
Aviral Goel	6fd8f800d8	fix(grouped_gemm): pipeline selection when tail_num varies per group and leads to numerical error (#2863 ) * fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num * WIP: add temp config to stress test numerical error correction * refactor: remove comments [ROCm/composable_kernel commit: `db79fad16f`]	2025-09-16 18:43:19 -07:00

1 2 3

103 Commits