composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-14 19:18:35 +00:00

Author	SHA1	Message	Date
Kiefer van Teutem	e8f9bb0c19	[CK_Tile] Refactor amdgcn_mma policy structs (#5272 ) ## Motivation The point of this MR is to update the intrinsic layout parameters to simplify them and make them more clear and flexible. Also, a number of simple refactors were performed to reduce boilerplate and code duplication. ## Technical Details In CK Tile and old CK, the full set of information available in the intrinsic wrappers, for WMMA and MFMA combined, would be something like: ``` // Basic info using ADataType = void; using BDataType = void; using CDataType = void; using AVecType = ext_vector_t<ADataType, 0>; using BVecType = ext_vector_t<BDataType, 0>; using CVecType = ext_vector_t<CDataType, 0>; // Fragment sizes static constexpr index_t kM; static constexpr index_t kN; static constexpr index_t kK; // Layout parameters static constexpr index_t kAMBlock; static constexpr index_t kBNBlock; static constexpr index_t kRepeat; static constexpr index_t kAMLane; static constexpr index_t kBNLane; static constexpr index_t kABK0PerLane; static constexpr index_t kABKLane; static constexpr index_t kABK1PerLane; static constexpr index_t kCMLane; static constexpr index_t kCNLane; static constexpr index_t kCM0PerLane; static constexpr index_t kCM1PerLane; using kABPs2RHssMajor = sequence<2, 1>; using kABPs2RHssMinor = sequence<1, 0>; using kABYs2RHsMajor = sequence<2, 2>; using kABYs2RHsMinor = sequence<0, 2>; using kCPs2RHssMajor = sequence<1, 2>; using kCPs2RHssMinor = sequence<1, 0>; using kCYs2RHsMajor = sequence<1, 1>; using kCYs2RHsMinor = sequence<0, 2>; using kCTPs2RHssMajor = sequence<2, 1>; using kCTPs2RHssMinor = sequence<1, 0>; using kCTYs2RHsMajor = sequence<2, 2>; using kCTYs2RHsMinor = sequence<0, 2>; ``` Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to: ``` // Basic info using ADataType = void; using BDataType = void; using CDataType = void; // Fragment sizes static constexpr index_t kM; static constexpr index_t kN; static constexpr index_t kK; // Layout parameters static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even for diff A / B layouts static constexpr index_t kAKNumAccess; // K2 static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and CDNA block hiding. static constexpr index_t kBKNumAccess; // K2 static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and CDNA block hiding. static constexpr index_t kCMPerLane; // M2 * M0 static constexpr index_t kCMNumAccess; // M2 // Derived properties using AVecType = ext_vector_t<ADataType, 0>; using BVecType = ext_vector_t<BDataType, 0>; using CVecType = ext_vector_t<CDataType, 0>; ``` Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features. Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency. I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp Other refactorings I did in this MR: - Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier. - Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code. - Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead. - Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately). - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 09:07:00 -06:00
chris-tsiaousis-hpc	9c1c4c9168	Changed the include order of the new WMMA/MFMA unification framework (#5241 ) Those changes are to fix the include order and make header files independent of one another. Also the `remod.py` sript has run and changed the `grouped_convolution.hpp` and `core.hpp` files. ## Motivation Some headers appear to depend on include order. For example, when moving `#include "wmma/wmma.hpp"` in [amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp) later in the include list, it is causing compilation errors. Also the pre-commit script `remod.py` is shuffling includes to be in alphabetical order and is causing compilation issues. Expected behaviour: Headers should be independent of one another: no header should require another to be included first. Each header should compile correctly on its own. ## Test Plan The CI (that runs `remod.py`) should compile. ## Test Result Existing CI should compile and be green. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-03-12 09:26:58 +01:00
Anton Gorenko	25d9fdfc16	[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368 ) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 09:59:50 +00:00
Hosang	559ad6f0b1	[CK_TILE] Update gfx11 FMHA forward kernel configs (#5088 ) ## Motivation Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded seqlen_q/k) cases. This tuning is based on heuristic search and improves performance in most tested shapes. Performance should be evaluated on top of [`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018) (required baseline). ## Technical Details - Updated gfx11 codegen heuristic choices for tile size and occupancy. - Updated gfx11 pipeline selection: - Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad cases are dispatched to the faster kernel path.` - Kept gfx12 unchanged: with PSSK support from [`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957), existing gfx12 config is already sufficient. - Tuning rationale: - In some cases, higher `kBlockPerCu` lowers register pressure. - On RDNA, this generally aligns with better performance when `waves_per_eu >= 6`. ## Test Plan - test_ck_tile_fmha - tile_example_fmha_fwd: tested this on gfx1100 and gfx1151 ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - TFLOPs by sequence length target: `gfx1100` layout: `bhsd` - mode: batch / VGPR usage: 225 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 74.10 \| 71.97 \| 0.97x 4096 \| 66.26 \| 77.79 \| 1.17x 8192 \| 68.18 \| 75.88 \| 1.11x 12288 \| 68.47 \| 80.44 \| 1.17x 16384 \| 59.54 \| 79.66 \| 1.34x 20480 \| 55.78 \| 77.91 \| 1.40x 24576 \| 55.08 \| 77.47 \| 1.41x 27280 \| 47.45 \| 77.16 \| 1.63x - mode: group / VGPR usage: 256 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 71.47 \| 70.6 \| 0.99x 4096 \| 64.74 \| 77.06 \| 1.19x 8192 \| 64.68 \| 75.47 \| 1.17x 12288 \| 66.43 \| 79.95 \| 1.20x 16384 \| 56.02 \| 79.73 \| 1.42x 20480 \| 50.21 \| 78.15 \| 1.56x 24576 \| 47.29 \| 77.53 \| 1.64x 27280 \| 46.13 \| 77.04 \| 1.67x - TFLOPs by sequence length target: `gfx1151` layout: `bshd` - mode: batch / VGPR usage: 225 vs 223 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 26.85 \| 29.17 \| 1.09x 4096 \| 24.75 \| 26.01 \| 1.05x 8192 \| 25.24 \| 25.50 \| 1.01x 12288 \| 25.18 \| 25.00 \| 0.99x 16384 \| 24.79 \| 25.91 \| 1.05x 20480 \| 25.56 \| 25.24 \| 0.99x 24576 \| 25.13 \| 26.20 \| 1.04x 27280 \| 10.78 \| 26.35 \| 2.44x - mode: group / VGPR usage: 256 vs 229 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 27.44 \| 26.71 \| 0.97x 4096 \| 21.89 \| 23.09 \| 1.05x 8192 \| 22.85 \| 24.49 \| 1.07x 12288 \| 24.33 \| 24.42 \| 1.00x 16384 \| 20.05 \| 24.98 \| 1.24x 20480 \| 14.70 \| 25.15 \| 1.71x 24576 \| 11.30 \| 26.31 \| 2.33x 27280 \| 10.10 \| 26.32 \| 2.61x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-10 09:46:41 -07:00
chris-tsiaousis-hpc	9d325ac47c	[CK TILE] Unification of sparse MFMA/WMMA policy structs (#4837 ) ## Motivation The existing unification work supports DENSE intrinsics. In this PR we enable support for SPARSE as well as SCALE intrinsics and add an example SPARSE implementation. ## Technical Details Mostly trivial changes. One framework change is that the desired `MmaOpFamily` is passed to the `MmaDefaultSelector`. As my relevant commit explains, we do not support a fallback family at the moment, but it is something we can consider. ## Test Plan Added a new test for the relevant sparse specializations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-03-05 19:52:04 +00:00
Kiefer van Teutem	2fcb94c18a	Add Tile Distribution Encoding Register Mapping debug utility for MFMA / WMMA unification work. (#4804 ) ## Motivation This PR adds a small utility that allows you to use Tile Distribution Encodings to directly map matrix elements to register locations and vice versa. It can also print forward and backward layout mappings similar to the Matrix Calculator utility. The utility is not meant for index calculations in actual kernels, but rather as a debugging tool and probably for automated verification of the policy structs in the new WMMA / MFMA unification design. ## Technical Details Tile Distribution Encodings are a core part of CK Tile which can define the relationship between register and intrinsic matrix fragment elements. They allow for any mapping based on unmerge and merge transformations. Also, they allow for a special "Repeat" dimensions which acts like an additional matrix dimension and allows for replication of certain matrix elements. The new mapping utility can deal with all aspects. ## Test Plan Since this is a debug utility there is nothing to directly test, but there is an example file that defines four different Tile Distribution Encodings and prints their forward and backward mappings, along with some extra parameters. ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-02 09:42:00 -07:00
Bartłomiej Kocot	9ad26943ac	[CK][CK TILE] Improve oob check (#4791 ) ## Motivation Improve OOB checks. Remove permutes which have been generated by thread buffer zero clear. at now in assembly there is only condmask instead of permute + condmask. Change number of KPack for generated instances ## Technical Details Remove permute instructions from assembly ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com>	2026-02-24 22:40:48 +01:00
assistant-librarian[bot]	45c177c384	[CK_TILE] Extend support of mix precision microscaling BQuant (#4267 ) ## Proposed changes Supported types combinations using BQuant=e8m0: - A=bf16 - B=bf16,bf8,fp4 Summary: - remove usage of `pk_fp4_raw_t`: consistent with other implementations and avoid taking into account of the packed size explicitly. In general, the raw type should not be used because CK Tile internally takes care of the PackedSize, so using the raw type adds unnecessary complexity to the implementation - handle microscaling by checking for `e8m0` type for BQuant (previous implementation was inconsistent) - add support for scaling instructions in `DequantPack8` - mx pipeline: - extend existing pipeline to support different B types - add support to scale and cast before writing to LDS or after reading from LDS (this can be defined in the `Problem` by the user) - block gemm: - mx pipeline is now using block gemm BQuant - block gemm BQuant can now load from LDS and apply scale and then call block gemm universal operator. This adds new functionalities and remove code duplication - warp gemm: - add case to support 128bit ds_read/write for both A and B when A=16bit and B=8bit - add examples and tests: note that some tests for bf16/fp4 already existed but were removed during previous tests refactoring. I added them again and other relevant tests for new types combinations ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689) 🧑‍💻 Originally authored by @EnricoDeg --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-24 09:55:50 -08:00
Anton Gorenko	ce6acc5f66	[CK_TILE][FMHA] Support gfx11 (#4584 ) ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-20 17:15:10 -08:00
Bartłomiej Kocot	4bf06885af	[CK][CK TILE] Add has hot loop check for pipeline v1 (#4407 ) ## Motivation Add has hot loop check for pipeline v1 (v1 basic and v1 basic async). Enable more tests which have been fixed by this change. ## Technical Details Hot loop has been executed without num loop check. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-651 AICK-663	2026-02-11 13:41:59 +00:00
assistant-librarian[bot]	4304c2c38e	[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280 ) ## Proposed changes gemm blockscale eightwarps support ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650) 🧑‍💻 Originally authored by @kensclin --------- Co-authored-by: KenSCLin <lshyhchy@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-09 11:54:54 +08:00
Emily Martins	2a765fbbad	[CK_TILE] Fix MMA concepts compiler error (#4381 ) ## Motivation CK Tile is required to support certain older OSs; on these OSs, cpp 20 is not fully supported. For ROCm 7.2, compiler errors occur on one of these older OSs. An example of this error is as follows: ```bash /composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments 34 \| { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>; \| ``` The goal of this PR is to resolve these compiler errors. ## Technical Details The existing guards around the mma concepts only check if the concepts language feature is supported, as follows: ```cpp #if defined(__cpp_concepts) && __cpp_concepts >= 201907L // ... template <typename CtrlFlags> concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) { // Flag members for Gfx9 MFMA instructions { CtrlFlags::Cbsz } -> std::convertible_to<int>; { CtrlFlags::Abid } -> std::convertible_to<int>; { CtrlFlags::Blgp } -> std::convertible_to<int>; }; #endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L ``` That said, in cases where functionality from the `<concepts>` header is used (e.g., `std::convertible_to`), this guard fails to check whether the `<concepts>` header is available. This change adds an additional check to the concepts that make use of functionality from the `<concepts>` header to ensure the header is available. ## Test Plan I tested the changes on the relevant docker for gfx90a, gfx950, and gfx942 and the compiler issue is not present. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-06 16:26:57 -08:00
Jan Patrick Lehr	470f031e58	[Compiler] Addressing new compiler warnings (#3640 ) * [Compiler] Addressing new compiler warnings Clang enables new lifetime warnings in production and we see build errors due to this with the staging compiler. The attributes added in this PR are suggested by the compiler. However, I'm not very familiar with the code base, so the changes may be incorrect. * Update some more instances * Adds file-level ignores via clang diagnostic pragma The number of instances was large, so I decided to use file-level scope to disable the warning via pragma clang diagnostic ignored. It also showed this warning coming from the gtest dependency. For that, I did add the respective command line flag to the CMake variables. I don't know if this is acceptable or not. * This adds the remaining instances For a build on gfx90a. * fix clang format * Adding couple more instances from gfx1200 build * Fixed another few instances --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `069500464d`]	2026-02-02 09:39:48 -08:00
ZheWang	c006b10452	Mx fp6 flatmm (#3601 ) * add fp6 data-type and support sync/async dwordx3 load/store * clang-format * pre-commit * 1st commit * default mnk pass ut * fix a distrubution * fix * fix bdram distr * update * pass ut * improve perf * update * clean code * resolve copilot comment * reslove comment * clang-format --------- Co-authored-by: ZheWang <zhewan@amd.com> [ROCm/composable_kernel commit: `e6bcd192d4`]	2026-02-02 16:04:40 +08:00
jiangyon.ren	ce51308aaf	[CK_TILE][FMHA] Add sparse attention VSA (#3341 ) * add sparse attention VSA * fix the pre-commit * Add jenga test and pre-commit * add bf16 for vsa * add jenga support bf16 * remove lse arg * split kernel code to block & kernel * fix the pre-commit * fix the pre-commit * fix the copyrights * fix the copyright * fix the copyright & rename block to pipeline * fix the copyright and pipeline * remove lse & dropout & add fmt * fix the jenga&VSA code review * remove the useless code & resolved the comments * remove useless code * remove useless code * Clean up code * Remove more unused code * Re-format .hpp * Refactor codegen scripts --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `4d2f8c111e`]	2026-01-31 00:59:47 +08:00
Erwin Terpstra	09d443a7ad	[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603 ) * chore: split block scale example instances in more separate files to speed up compile times * wip: fp4 scaffolding for abquant * feat: add fp4 decoding-while-loading to abquant pipeline * feat: add support for fp4 CPU verification in abquant * chore: add time tracking to reference calculation * feat: add a4w4 test for blockscale gemm * feat: optimize reference calculation by preconverting values to AccType * feat: add fp4 to fp8 look-up table * fix: reference to wrong ComputeDataType field in QuantProblem * feat: type utilities for determining MFMA compute types * feat: packed fp4 for abquant weight preshuffle * feat: add separate tests for a4w4 base case, padding and preshuffleB * fix: fp4 conversion on gfx950 attempting to use non-supported method * fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size * chore: add fp4 preshuffleb mode to block scale example * chore: sanity check for packed types being 1 byte * chore: clarify tensor dimension indices with constants * chore: replace traits check with specialized check for packed types * style: some minor refactoring and cleanup * fix: correct conversion table for FNUZ fp8 * chore: add fp4 instances to main abquant instances again * chore: use same initialization branch for int4 and fp4 * chore: add missing initialization for fp4 in block scale gemm example --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `6a6177a246`]	2026-01-30 04:40:50 -07:00
Yi DING	bb0986e59e	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `8e3d84aba3`]	2026-01-27 23:46:49 -08:00
Max Podkorytov	8b842250da	Add persistent async input scheduler for GEMM kernels (#3520 ) Add signal-based synchronization for persistent GEMM kernels where input data becomes available incrementally. Uses modulo wraparound (like PyTorch's AsyncMM) for chunk index calculation: chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks Key components: - PersistentAsyncInputScheduler struct with tiles_per_chunk_m, chunk_signals, tile_idx_pivot_m, and num_chunks fields - wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency - IsSupportedArgument validation for scheduler parameters - Example demonstrating async input scheduling with simulated producer - GTest unit tests covering all layout combinations [ROCm/composable_kernel commit: `91b4102a59`]	2026-01-20 10:37:09 -08:00
Adam Osewski	a9ff38bc89	[CK_BUILDER] Convolution forward transfer concepts. (#3535 ) * Rename member variable to better reflect its actuall meaning. * Add transfer checks for conv fwd xdl. * Validate tensor layouts & vector size conv fwd v3. * Add combined transfer concepts. * Add transfer concepts for conv fwd factories. * Fix clang format * Add helper instruction to get max mem vector instruction width. * Apply review comments. * Rename thread cluster access(->arrange) order concept * FIx merge artifacts. * Add generic access order limits into block transfer concept. [ROCm/composable_kernel commit: `1a6d1b59ef`]	2026-01-19 10:54:10 +01:00
Cong Ma	026c9200ee	[CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512 ) Refactor function amd_buffer_load_invalid_element_return_zero to avoid the inefficient ASM code generated by compiler. Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage Tested compilers: - Rocm 7.0.1 - Rocm 7.1.1 Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `d7497d2694`]	2026-01-07 00:05:56 -08:00
joyeamd	00d05ab32e	Merge some updates for ck_tile headers (#3342 ) * fix some issues from internal branch * update cshuffle_epilogue * update cshuffle_epilogue * update cshuffle * update warp_gemm [ROCm/composable_kernel commit: `b78563b3d3`]	2026-01-05 23:39:00 -08:00
Estevan Vedovelli	32e805b853	Add support to gfx1153 and fix gfx115X WMMA config (#3496 ) * Support for gfx115X * Changes for gfx115X * Add gfx1153 * Update changelog --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `1224bc0a82`]	2026-01-05 10:03:30 -08:00
Bartłomiej Kocot	80eaeacea5	Update AMD buffer coherency (#3403 ) * Update AMD buffer coherency [AICK-421] * fixes * fix * fixes * fixes * Add backward compatilibity * fix * fixes * fix * fix * fix * Update grouped_convolution_backward_weight_kernel.hpp [ROCm/composable_kernel commit: `700b2ec9c0`]	2025-12-18 10:16:22 +01:00
ltqin	c8397e8ef2	flashattention fwd add (80, 96) instance (#3415 ) * add hdim (96,96) instance * change to (80,96) * format py * remove 96 in optdim * when N=6 change to llvm_amdgcn_raw_buffer_load_i32x3 [ROCm/composable_kernel commit: `92653168c2`]	2025-12-17 09:16:11 -08:00
eliotwang	d5645ff481	Bf16fp4 gemm (#2801 ) support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * rebase to new develop * fix clang format * update code according to reviewer's comment * Update README.md * update code according to reviewer's comment * update code according to reviewer's comment * Update CMakeLists.txt * Update README.md * Update CMakeLists.txt * Delete files * Delete files * Add unit tests * Update test_gemm_quant_base.hpp * merge bf16fp4 example to develop branch fix clang format * fix clang format * Update CMakeLists.txt * fix ci test * fix clang format * resolve conflicts --------- Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu> Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `715671e419`]	2025-12-11 07:20:29 -08:00
Zzz9990	822da5d3a7	[CK_TILE MOE] add NT & preshuffle permute to cktile MOE (#3377 ) * update coherence --------- Co-authored-by: Zzz9990 <Zzz9990> [ROCm/composable_kernel commit: `1aa93ef551`]	2025-12-10 10:03:28 +08:00
msaffari-amd	4f5a48c910	Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic-add (#3236 ) * Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic * correct clang-format * removed unused rtol_atol variable from example code * clang format correction * remove unused varable max_accumulated_value from example [ROCm/composable_kernel commit: `f875ab0bbc`]	2025-11-28 09:43:01 +01:00
Aviral Goel	216c23b945	chore(copyright): update copyright header for include directory (#3293 ) [ROCm/composable_kernel commit: `de6466481f`]	2025-11-26 11:00:05 -07:00
Christopher Millette	10eb15416c	First look at mfma / wmma unification (#2704 ) * First look at mfma / wmma unification * Refactor * Re-org file structure * Restructure transform selection and WaveWiseMma class * Update license files. Add missing gfx1151 support. Change wave size for HOST to 1. Update datatypes naming consistency * Fixes default MmaSelector implentation * Adds unit tests for amdgcn_mma and arch * Consolidate common arch id checks to constexpr functions. Strongly type ids as amdgcn_target_arch_id object. * Refactor is_any_value_of * Fixes mma_selector logic * Fix typo * Add mma selector test for tile decomposition * Fix compilation of mma.hpp * Revert back to c++17 compatibility * Fix compiler error by returning index_t from get_warp_size() * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes compiler error for missing is_wave32() function * Fixes compiler error for host wave_size() should be 64 * Fixes compiler errors where __cpp_concepts is not defined * Fixes compiler errors where __cpp_concepts is not defined * Fix test failure for host is wave64 by default --------- Co-authored-by: Chris Millette <you@example.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `b9c6cb1452`]	2025-11-24 09:39:59 -08:00
Bartłomiej Kocot	a2a69e7649	[CK_BUILDER] Add grouped conv fwd ck tile traits (#3183 ) * [CK BUILDER] Add grouped conv fwd ck tile traits * Update instance_traits_tile_grouped_convolution_forward.hpp * Update grouped_convolution_forward_kernel.hpp [ROCm/composable_kernel commit: `92c1f4981a`]	2025-11-11 13:55:33 -08:00
joyeamd	846b43f43b	add gfx11's barrier following SPG's reference (#3159 ) * add gfx11's barrier following SPG's reference * re-format the code * minor fix --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `12922120d2`]	2025-11-05 22:29:03 -08:00
Max Podkorytov	b7a073f769	[CK-tile] unhardcode the number of LDS banks from universal gemm policy (#3130 ) Fixes LDS bank conflicts on gfx950 for universal gemm v3 pipeline Replaces hardcoded LDS layer calculations with dynamic computation using the new architecture helpers Adds architecture-specific helper function get_n_lds_banks() Changes function attributes from CK_TILE_HOST_DEVICE to CK_TILE_DEVICE in universal gemm policy [ROCm/composable_kernel commit: `04efd282cf`]	2025-10-31 11:58:11 -07:00
Anton Gorenko	220bd7a9bb	[CK_TILE] Support WMMA (gfx12) in FMHA (#2528 ) * Pass hdim to tile_example_fmha_fwd in fp8 tests * Add WMMA support to fwd FMHA pipelines * Tune tile sizes a bit for less spilling fp16 256 is still quite slow * Fix Q grad tile distribution for warp size = 32 and hdim >= 256 With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile. * Use code based on BlockDropout in BlockDropoutBwd * Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal * Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9 because they both equal to 8 while on gfx12 they are 8 and 4; * Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1], it worked on gfx9 because 4 * kM == kN1 == 32; * Removing these hidden dependecies allows to support: * any number of warps (power-of-2), not only 4; * kN1 = 16, not only 32; * any number of splits; * Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now * Replace hard-coded kN1 in dispatch code with the requested tile size * Add gfx12-specific tile sizes for split KV * Pass GPU architecture to kernel generation scripts This is still a temporary solution. * Build and run FMHA CI tests for gfx12 * Fix issue after merging * Fix bwd tile sizes The current pipelines always read only one tile K and V tile, this requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and kK2 == kVHeaddim). * Use hardware f32->f8 on gfx12, remove v_perm __builtin_amdgcn_perm is not needed because __builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of 32-bit dword) is used to store results (two f8 values). * Update changelog * Add WMMA support to pagedkv * Fix scripts after rebasing * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Fix names after cherry-picking * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Do not use filters related to qr_async_trload They disable tiles/pipelines which are valid for gfx12. * Use different dstr encoding when C is transposed * Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code Some WarpGemmDispatcher instantiations are defined only for specific archs and undefined on host. Calculations related to sched barriers are moved from Pipeline's public fields into pipeline's operator(). * Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used in codegen scripts are 32, 32, 32. * Generalize usages of WarpGemmDispatcher for MFMA and WMMA WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still used explicitly becaus of swizzle factor = 4. * Mark has_load_tr as maybe_unused There are no transpose loading for RDNA. * Remove CK_TILE_USE_MFMA/WMMA from fmha-related code * Detect BlockSize on host based on warp size of the current device If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with kBlockSize / 2 because on host get_warp_size() == 64 always. * Fix calculation of grid size for combine kernel with warp size = 32 * Add missing includes and header * Support multiple archs in one binary for fwd * Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill * Support multiple archs in one binary for bwd * trload kernels are compiled only for gfx950; * instances with padding are checked after instances without padding so they can be used as fallbacks (similarly to fwd); * Extract common code from register_traits * Revert "Fix regression with philox seed and offset when they exceed 32-bit int" To simplify merging , the proper fix is in develop already. * Support new numerical d paddings in trait ordering checks * Build fp32 tests only on gfx9 * Do not use hardcoded M0 = 64 for dot bwd kernel * Use textwrap.indent from standard library * Make fp8 pipelines on gfx12 consistent with gfx9 * Update tests for current pipelines * Make ninja check more responsive in CI ninja buffers output so this job looks hanging. * Support fp8fp32 by limiting O vector size The fp32 output type requires storing 8 * sizeof(float) = 32 bytes, which is not implemented (here 8 is the number of C values per lane for v_wmma_f32_16x16x16...). * Remove unused cmake options * Unify including amd_buffer_addressing.hpp/_builtins.hpp * Temporarily use amd_buffer_addressing.hpp on >=gfx10 amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores which is not compatible with >=gfx10: * 1 scalar for exec masks instead of 2, * gfx12 uses different instruction names etc. * Update asm in bf16 conversions to work with warp 32 * Do not generate splitkv/appendkv with vlayout=col for consistency with fwd * Add arch tags to kernels/host funcs, compile for each arch separately * Add kM0 to fmha_bwd_dot_do_o kernel name to match filename * Add workaround for miscompilation of bwd with padded hdim SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent branches used to store padded tensors (when some lanes are inactive due to padding). Inline asm with dummy dependencies on VGPRs of the tensors prevents the compiler doing this. * Fix add_gtest_executable for absolute paths Some tests (like gemm_tile_engine) pass absolute paths to source files. In CI the branch name is a part of the root dir, and if the branch name contains "wmma", "xdl" etc., files can be incorrectly excluded. * Run only hdim 128 smoke tests for fp8fp32 There are no instances for hdim 64 and 256. * Format py with ruff to simplify merging develop * Fix incorrect var name * Codegen for gfx9,gfx950 when --targets is not specified Aiter and Pytorch require changes for passing their targets to the codegen scripts. With this temporary solution the files are generated but not all of them have to be really built (depending on the used --offload-arch=). * Combine arch-related values into ArchTrait This more centralized approach removes duplication of various formatting templates. * Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large" Some code is extracted into a function. [ROCm/composable_kernel commit: `1e77695fe8`]	2025-10-29 13:31:08 -07:00
lalala-sh	0329d71fb9	[CK_TILE] Update flatmm related kernels (#3022 ) --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `211d64e18a`]	2025-10-22 22:36:11 +08:00
Gino Lu	182c4404b5	[CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044 ) * Patch for pk_fp4_raw_t buffer load and ref check [ROCm/composable_kernel commit: `fb1d090f3c`]	2025-10-20 14:47:04 +08:00
joyeamd	ed83bcb9a2	update s_barrier's logic in gfx12 architecture (#3003 ) change s_waitcnt's logic in gfx1250 change s_waitcnt's logic in gfx1250 update comment [ROCm/composable_kernel commit: `b9d74e7746`]	2025-10-14 08:49:34 -07:00
Max Podkorytov	0f7644177c	Add a new gemm pipeline based on ComputeV4 which utilizes async copy API (#2949 ) * check in pipeline and policy for async load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked solve merge conflicts * fix cmakelists * make it build * fix? buffer async fence * relax fences; it appears it only is needed between pairs of ping-pongs * remove fences * remove fences * cleanup and reformat * add steps annotations * comment all pipeline steps / remove unexplainable syncs * clang-format * add comment * cleanup kernel types for test * fix comment * fix hardcoded warp size * faithfully copy block gemm from compute v4 policy to async policy * make async test gfx950 only * fix cmake logic * set separate compile options for async * refine comment in policy * try update hotloop scheduler * cleanup comments * test more K block sizes * unhardcode Ks, sort of * add large odd test case * fix build for quant * add comment to hot loop scheduler and rename enum * reformat * reword the pipeline description * reformat * address review / add static asserts / typo fix * update changelog [ROCm/composable_kernel commit: `a7da3c68b9`]	2025-10-01 15:38:07 -07:00
Sami Remes	18f4a0728b	Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe (#2893 ) * Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe * also do the same for amd_buffer_addressing_builtins.hpp * merge with develop * fix clang format --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `ef43078788`]	2025-09-30 15:12:30 -07:00
Anton Gorenko	bc9362af55	[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836 ) * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Add F32 MFMA warp gemms * Support f32 in fwd FMHA * Implement transpose_vectors for 4-byte types (float) * Fix unexpected implicit f32->uint32 cast in buffer_store<4> __builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint). mbuf_t types in other buffer_store<> are changed for consistency. * Support F32 in bwd FMHA hdim = 256 is disabled for now because it uses too much memory on gfx90a * Support Headdim = 48 (divisible by 16) in fwd * Add fp32-specific receipts (800 and 801) * Tune fwd tiles * Tune bwd tiles * Use small tiles only for small seqlen_q * Fix after rebasing * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Remove constraints and adjust filtering for fp32 Custom constraints are no longer needed because now the smallest tile is selected automtically based on seqlen_q. Filters related to qr_async_trload disabled valid fp32 tiles. * Add fp32 tests * Make splitkv and appendkv compile for fp32 only There are no instances yet, but API still must compile when only fp32 is requested. * Remove unimportant f32 instances * Add test_ck_tile_fmha__fp32 to REGRESSION_TESTS Replace magic numbers with a constant, improve comments for dropout * Update changelog * Fix condition that dq_acc must be set to zero when mask is used The change was introduced in #2799 * Replace warp_uniform with recently added amd_wave_read_first_lane * Add hdim = 96 and 192 to fwd [ROCm/composable_kernel commit: `1edd250115`]	2025-09-27 18:03:48 +05:00
Khushbu Agarwal	9ed178a93e	Fix for Add the API to load SGPR (#2913 ) * Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)" This reverts commit 4c78cc31c5b8e0c9db09c24fa35393f603a8a47f. * Fix: sgpr minor issue * cyclic dependency resolved * clang formatted * removing unused variable * clang formatted --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b56e5d1d79`]	2025-09-25 10:32:42 -07:00
Sami Remes	c5a3d4c765	[CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903 ) * Fix issue with constexpr checks in scaling/cshuffle * Remove IsLoadableTile * Move amd_wave_read_first_lane before first usage [ROCm/composable_kernel commit: `dcd33a6ecc`]	2025-09-23 23:08:18 -07:00
asleepzzz	651a5dd0b9	Revert "[CK-Tile] Add the API to load SGPR (#2878 )" (#2904 ) This reverts commit 3e008a2d22ad1ba8a9b2c7eca369a8593b7d6e95. [ROCm/composable_kernel commit: `f161b5b738`]	2025-09-23 14:33:51 -07:00
Thomas Ning	e3702467d5	[CK-Tile] Add the API to load SGPR (#2878 ) * Have a workable version for SGPR * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * substitute with the new sgpr read api * update the CHANGELOG * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * change to static for logic * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. [ROCm/composable_kernel commit: `2cbbf5dcb3`]	2025-09-23 01:23:56 -07:00
linqunAMD	a9e6cb0ec0	Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725 ) * Enable xdl in gfx11 & gfx12 * update cmake file * fix all instance build (cmake) * fix batched_gemm_gemm(cmake) * rebase cmake files * fix cmake build error * remve CK_ENABLE_DYNAMIC_WARP_SIZE * update cmake build error2 * fix gfx11 build CK_USE_XDL is enabled on gfx11 and gfx12 * fix gfx10 build * fix gfx11 error --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `f22740df82`]	2025-09-15 10:59:25 -07:00
Thomas Ning	82cd975698	Fix the vector load & fix the gfx950 compv4 error (#2831 ) [ROCm/composable_kernel commit: `1894a0dbc3`]	2025-09-12 11:48:45 -07:00
Illia Silin	84ee8c0a7e	Revert "add vector load 16/32 for bf16/fp16 (#2779 )" (#2818 ) This reverts commit e04d4c6fcb124b6c00af9f7378fb672e7f0ef87d. [ROCm/composable_kernel commit: `b4207c01c7`]	2025-09-10 13:35:15 -07:00
zjing14	72dcde393a	add vector load 16/32 for bf16/fp16 (#2779 ) * add vector load 16/32 for bf16/fp16 * comment addressed * clang format --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `7ecdba878f`]	2025-09-09 23:15:19 -07:00
Yi DING	a29eac5da0	[CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715 ) * FMHA avoid unnecessary vmcnt0 Squashed commit of the following: commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:51 2025 +0000 merge develop and solve conflicts commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c Merge: 2dac61a4f 13a6816fb Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:21 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140 Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 02:17:43 2025 +0000 update bwd commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682 Author: Kevin Choi <kevin.choi@amd.com> Date: Mon Aug 18 19:36:38 2025 +0000 add restrict to applicable functions commit 45534dee5bcbe532da46fc5cd6601cde10d84387 Author: Ding, Yi <yi.ding@amd.com> Date: Mon Aug 18 02:07:03 2025 +0000 bwd filter commit 7abd7b372b82cba94a457238b6b4a81d093e7280 Author: Kevin Choi <kevin.choi@amd.com> Date: Sat Aug 16 08:15:23 2025 +0000 remove noinline attr as it causes a lot more s_waitcnt's commit 89c29746a09255c1d26038171157e91d1b68d14a Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 12:11:17 2025 +0000 remove innerloop, move restrict parameters to mainloop and add noinline attribute. commit 6f61b3a5c80011411aa3aebf7983602f7c117566 Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 07:06:51 2025 +0000 Create inner lambda with restrict parameters, add restrict to some parameters commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a Author: aska-0096 <haocwang@amd.com> Date: Thu Aug 14 03:43:54 2025 +0000 save for debug commit 5f2c3cfa86c6951208a1cc227fa556704a885a88 Merge: 25f067b4f 165a2723c Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:15:22 2025 +0000 Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 25f067b4f09d6909a05e252c7621124046dfda57 Merge: 447c1c5d6 2ad2f97b7 Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:14:26 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 165a2723c557420b48891cc1ce3434e3675aef5d Merge: 447c1c5d6 4491739ab Author: asleepzzz <hanwen.chang@amd.com> Date: Wed Aug 13 00:34:11 2025 +0800 Merge branch 'develop' into wip-async-tr-fa commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 14:25:50 2025 +0000 refactor blockgemm change, isolate to v2; commit 8f67083511ff77d31c880f4427d3bdf53a179568 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:26:13 2025 +0000 clang format commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965 Merge: c84662c34 245071bcf Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:04:41 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit c84662c345755ec5f3d524fdde4aa951c8f86298 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 08:46:06 2025 +0000 Fix the bug commit e0647ffa5646f8132529b152af02750c4010013d Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 04:02:41 2025 +0000 fix conflict. disable all v-col instance for fmha fwd commit 781f98236c376f57591a6d481cc2ee04b36a148b Merge: 241f3d7dc 6e03d9607 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 03:52:34 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:53:31 2025 +0000 clang format commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f Merge: 1a629c098 3639befe9 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:52:52 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 1a629c09876cc05f0750db7eade1d527dc32a1d3 Merge: f65874e5b b34a029cd Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:59:40 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit f65874e5b07579d5b734b4c68877679a3ee04dac Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:37:37 2025 +0000 change the warp setting for hdim32 fmha fwd commit 7c5f5e65e97486c074ef9a138900ed9aafea547e Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 14:21:09 2025 +0000 tempsave, update the blocksync functions commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915 Author: aska-0096 <haocwang@amd.com> Date: Sun Aug 10 06:00:51 2025 +0000 fix bug in pki4 commit 073db2e18af21f1ed1fb3d1f1c15830838df986f Author: aska-0096 <haocwang@amd.com> Date: Sat Aug 9 03:25:12 2025 +0000 fix bugs in gemm commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 18:35:53 2025 +0000 fix bug on non-gfx950 commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 17:53:19 2025 +0000 fix bug commit 30de97f473685e0bd5b82f15eee2493d9a05cffd Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 15:42:15 2025 +0000 fix bugs commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:31:01 2025 +0000 fix clangformat with 18.1.3 commit e4cb185c41586d018771a5413efd909d8d53a8c5 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:07:40 2025 +0000 remove non-necessary change commit 498f0d44cfba17287cce8d10855cce5c5de263db Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:04:02 2025 +0000 bug fix, clang format; commit 3cb648cbc4883e6889340d85f48d803a21b9c805 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 08:08:03 2025 +0000 Remove unnecessary changes commit 9e7ff3b611b7933b65973907a0cae312a15d31c6 Merge: a3c1bfe6d 7f14bd1df Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 07:50:12 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 06:19:31 2025 +0000 remove unnecessary files; rename some files commit 6c257fa27729c005d539b5b71deeba3703031089 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 05:46:18 2025 +0000 merge fa_decode pipeline into fmha_fwd api commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449 Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 6 05:58:43 2025 +0000 add __restrict__ to tr load commit bbad2b979b701533b74f43452ffe0f775e019139 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 5 07:23:51 2025 +0000 Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug commit d7fabd5f765e2a573ddbaf0857ce6f691407e562 Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:27:42 2025 +0000 Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9 Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:02:17 2025 +0000 add vmcnt guard before load ktile commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 06:49:01 2025 +0000 Load Q through lds, implement xor; commit 62bb9f05177dfb8280d6c2be67a88492d6be4838 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 1 10:44:54 2025 +0000 small refactor commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 10:25:37 2025 +0000 upgrade prefill pipeline; simple iglp; consistent data produce and consume order commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 05:13:27 2025 +0000 enable larger tile size; upgrade xor pattern commit a468e59a01d6dd85c105ca30ac249491256c5915 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 12:25:33 2025 +0000 remove all lds bankconflict with xor layouts commit 39ff55cdc377311112100fb24bc013adfd8960c0 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 03:51:06 2025 +0000 enable prefill overload operator(). commit a7b152a788e8035c93f8e4cbf317863182665d8f Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 25 07:10:01 2025 +0000 fix the lds alignment caused performance regression commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 23 09:05:57 2025 +0000 remove unnecessary features commit 9758750801c7fd5a80f654eb982f43b87d674fa3 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 22 08:04:05 2025 +0000 tempsave. asynccopy+trload sanity checked commit 1c4c04d725047357224ebf8a2b94d9010a5651a6 Author: aska-0096 <haocwang@amd.com> Date: Mon Jul 21 05:55:55 2025 +0000 tempsave, trload+asyncload done commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 10:04:34 2025 +0000 compile pass commit d41b5eace939909084d32281710fb81142ad5fec Merge: 3f86a81ee 33204e15f Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:17:27 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 3f86a81eee75256a78df02032d50814aaa42b038 Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:16:39 2025 +0000 tempsave commit 7d43f7446a9a20773f70e08462393f6c9afb7280 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 10:06:09 2025 +0000 temp save commit 727629cd9115f1be9c1800bb65a8ea84ff06c250 Merge: aa5da19c9 94bceebc9 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 07:24:32 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline commit 94bceebc96ef4885e0ac861b7793e2e2897481bd Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 03:10:46 2025 +0000 move test_copy into test commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:41:31 2025 +0000 remove unnecessary output commit b1dbcacb1832560c6cc967a079dffce558228f0b Merge: 5b0d311e6 0eaf3325a Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:13 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix commit 5b0d311e649257557a7014c28fcfac0c327b77b5 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:10 2025 +0000 add input validation and bug fix commit 0eaf3325a8e019402ff12a2402f446f8471f584f Merge: a66e1d29a 08c5df68a Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Date: Wed Jul 16 11:23:57 2025 -0700 Merge branch 'develop' into test_copy_fix commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:55:50 2025 +0000 fix vmcnt shift commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:37:07 2025 +0000 Improve s_waitcnt_imm calculation commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 05:39:50 2025 +0000 fix the s_waitcnt_imm calculation commit 1c0870089a0e7c78ed71a278bf52d98fc780e482 Merge: d6ee05e36 d9de58c66 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:57:57 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:54:33 2025 +0000 Add block_sync_lds_direct_load utility commit c037a72040217471f52ee76bed9c07bf5b22aef4 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 15 09:39:03 2025 +0000 fix async copytest bug commit aa5da19c94022449b027e7a57668f2e219f0f171 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 10 04:29:33 2025 +0000 temp save, change all instance to 1wave commit ddd172feb9eb2cb783420a8db6f44d51b350c370 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 8 08:37:20 2025 +0000 tempsave, fmha_decode commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33 Author: aska-0096 <haocwang@amd.com> Date: Sat Jun 21 15:02:57 2025 +0000 temp save, waiting for debug commit 71dd31f15bca01995c8cb0be9e903103f4657181 Author: aska-0096 <haocwang@amd.com> Date: Thu Jun 19 05:11:52 2025 +0000 save an example for __bf16 type commit cdf33e079fa7d7d5b03b06550df2356b02041d7b Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 07:27:24 2025 +0000 fix bwd code commit d630998dc6751f44097b1e9a239bb5063a793736 Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 06:37:16 2025 +0000 Fix for fwd/bwd kernel build filter commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 08:13:18 2025 +0000 Add restrict to avoid unnecessary vmcnt --------- Co-authored-by: aska-0096 <haocwang@amd.com> * Add comments for c-stype cast * Better comments --------- Co-authored-by: aska-0096 <haocwang@amd.com> [ROCm/composable_kernel commit: `de61e55493`]	2025-08-25 20:55:12 +08:00
Po Yen Chen	3c57e29cdc	[CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values (#2669 ) * Allow return VGPR get_warp_id() value * Avoid using SALU in async_load_raw() [ROCm/composable_kernel commit: `0db21053e6`]	2025-08-22 10:17:05 +08:00
Po Yen Chen	a2b87dd39d	[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549 ) * Support async load dwordx4 * Enlarge load size on gfx950 [ROCm/composable_kernel commit: `4a7ecce096`]	2025-08-22 10:13:47 +08:00

1 2

95 Commits