composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
Enrico Degregori	2733e75900	[rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e) [CK Tile] Async support pipeline V3 ## Motivation Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async pipeline) ## Technical Details - Add `Async` bool to `Problem` struct to enable async pipeline in existing one - Add `static_move_ys` to load transpose. This generates offset in assembly instructions saving registers - Add `is_valid` to `async_get_vectorized_elements`. Before hard coded to true. It allows to support padding - Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr` (wider use of lds load transpose on gfx950) - Integrate async support in existing V3 pipeline (avoid pipelines duplication) - Create policy to support both async and default cases. This could be used by any async pipeline (next steps) - Define `wg_attr_num_access` separately for A and B. This allows to optimize ds_read instruction width for cases when one matrix is transposed and the other is not. Before in such cases, `ds_read_b64` was used instead of `ds_read_b128` - Add test for V3 async. Currently only supporting cases with A and B having the same type ## Test Plan New test `test_ck_tile_gemm_pipeline_compv3_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 06:57:14 +00:00
Yung-sheng Tu	e826b2eb7e	[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds two specialisations related to `tf32_t`. ## Technical Details This change treats `tf32_t` as a concrete type rather than an empty `struct`. It also adds two new specialisations for MFMA dense builtins and resolves existing circular include issues. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 12:27:41 +00:00
Enrico Degregori	7b9245f18c	[rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d) [CK Tile] Async support preshuffle GEMM ## Motivation Add async support to existing preshuffle GEMM pipeline ## Technical Details Notes: the implementation avoids previous strategy of duplicating pipelines for async support and instead add a switch `Async` to the ops Problem to enable async pipeline. Then, integrate the async pipeline in the existing one. This allows to avoid code duplication and facilitate the integration of buffer load to lds in existing pipelines. In my opinion, it should be used also for other pipelines which don't support buffer load to lds yet and it would also be a good idea to refactor the existing async GEMM pipelines with the same approach. Summary: - integrate buffer load to lds in existing pipeline - add optimal tensor descriptors for vmem loading and lds reading. They are currently optimized for 16x16 wave tiles but they also work for 32x32 wave tiles. Optimizations for 32x32 wave tile requires different lds layout and it will be done in a follow-up issue - Add async config to examples - Add test (gfx950 only) ## Test Plan New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 07:17:09 +00:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
JP-Fernando	e7798e9560	[rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac) Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112) ## Motivation The goal of this work is to apply XOR shuffle (swizzle) to the current `comp_async` GEMM pipeline and the `gemm_mx` pipeline. XOR swizzling has been helpful to avoid LDS bank conflicts, as data are redistributed across LDS banks, such that simultaneous threads accessing different rows land on different LDS banks. ## Technical Details A similar approach to the work in the existing eight-waves pipeline was followed. Currently, XOR swizzle support is available for FP8 and BF8 types. FP4 support is also available for MX GEMM. Should the types not match, or should the async vector width be of an unsupported size, then the pipeline falls through to the previously existing ('unswizzled') path. ## Test Plan Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM pipeline. Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for the MX GEMM pipeline. ## Test Result The tests passed successfully in the `Alola` cluster with MI350 hardware. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-05-21 09:36:41 +02:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Illia Silin	d16061f578	[rocm-libraries] ROCm/rocm-libraries#6550 (commit c396de9) [CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
msaffari-amd	cf517ec050	[rocm-libraries] ROCm/rocm-libraries#5863 (commit 31d9247) [CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863) ## Motivation The PermuteN epilogue was previously embedded within cshuffle_epilogue.hpp, despite having fundamentally different behaviour. Coupling these two independent strategies in one file introduced unnecessary complexity, SFINAE guards, and a dual operator() overload selected at compile time via TiledMMAPermuteN_ template parameter. This PR separates PermuteN into its own standalone file(pertmuten_epilogue.hpp), simplifying both implementations and making the codebase easier to maintain and extend independently. ## Technical Details New file: permuten_epilogue.hpp: contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from the permuteN code path in cshuffle_epilogue.hpp. Cleanup of cshuffle_epilogue.hpp: - Removed the TiledMMAPermuteN_ template parameter from [CShuffleEpilogueProblem] - Removed the SFINAE-guarded permuteN operator() overload - Removed the EnablePermuateN_ SFINAE alias - CShuffle now only contains CShuffle logic; EightWave support (independent feature) is retained Consumer migration : All consumer files now use compile-time epilogue selection via [std::conditional_t] `using GemmEpilogue = std::conditional_t< TiledMMAPermuteN, PermuteNEpilogue<PermuteNEpilogueProblem<...>>, CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;` Files modified: - flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp, mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples - run_gemm_quant_example.inc — block-scale GEMM example - gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker - test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp, test_gemm_pipeline_util.hpp — test utilities - universal_gemm_invoker.hpp — universal GEMM invoker - epilogue.hpp — add header updated to include permuten_epilogue.hpp ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-04-14 20:22:18 +00:00
yinglu	2c91dcb114	[rocm-libraries] ROCm/rocm-libraries#5612 (commit 38c9498) [CK]fix: remove redundant structured sparsity check in run_gemm_example.inc (#5612) ## Motivation This issue if found via https://github.com/ROCm/rocm-libraries/pull/4302#discussion_r2958603418 and is introduced via https://github.com/ROCm/rocm-libraries/pull/5323. ## Technical Details The outer `if` and inner `if constexpr` both checked GemmConfig::UseStructuredSparsity. Merged into a single `if constexpr` since both preshuffle and UseStructuredSparsity are compile-time constants. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 09:22:14 +01:00
assistant-librarian[bot]	544a3182c1	[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a) [CK_TILE] add tf32 support (#4302) ## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion --- 🔁 Imported from [ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538) 🧑‍💻 Originally authored by @yingluAMD --------- Co-authored-by: yingluAMD <Yingmao.Lu@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-19 10:17:20 +01:00
Thomas Ning	1ab29bf22f	[rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e) CK Tile MX GEMM Packing Improvement (#5323) ## Motivation Reduce the scale loading size and also has better utilization of MFMA scale selection. ## Technical Details Add up the packing of mx scales. ## Test Plan Use the existing test cases. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: Enrico Degregori <enrico@streamhpc.com>	2026-03-17 11:57:32 -07:00
Sami Remes	c1525b3f30	[rocm-libraries] ROCm/rocm-libraries#4594 (commit 1fce4cb) [CK_TILE] MX GEMM non-preshuffled RCR layout (#4594) ## Motivation Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled layouts using async pipeline. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2026-03-10 20:12:05 +00:00
Max Podkorytov	91b4102a59	Add persistent async input scheduler for GEMM kernels (#3520 ) Add signal-based synchronization for persistent GEMM kernels where input data becomes available incrementally. Uses modulo wraparound (like PyTorch's AsyncMM) for chunk index calculation: chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks Key components: - PersistentAsyncInputScheduler struct with tiles_per_chunk_m, chunk_signals, tile_idx_pivot_m, and num_chunks fields - wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency - IsSupportedArgument validation for scheduler parameters - Example demonstrating async input scheduling with simulated producer - GTest unit tests covering all layout combinations	2026-01-20 10:37:09 -08:00
Max Podkorytov	e339101e9c	[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder * hacky poc * fix tile engine * kill the lambda * maybe fix test build * more fixes * clang-format * save temp * clang-format * mostly fix examples * clang-format * remove dead code * more cleanup * fix fmha bwd build (default epilogue set/add appears to be broken) * fix default epilogue tests but not correctness * clang-format * fix bquant * clang-format * cleanup dead code * rearrange make windows for readability * restore changes to IsSupportedArgument * fix smoke-builder * clang-format * fixup rename class * build fixes * clang-format * fix builder * fixup * remove set from builder tests * fix test * clang-format * re-refactor the kernels * clang-format * fix header license * remove memory operation from conv bwd test * clang-format * clang-format example,include * clang-format test * build fixes * clang-format * solve compilation error * fix the CI * solve compilation error * clang format * solve merge conflict * solve merge conflict * solve the gfx11 error * solve test error * moar build fixes * remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-04 03:28:14 -08:00
Yashvardhan Agarwal	ea10a78203	[ck_tile] refactor reduce kernel (#3257 ) * refactor reduce kernel - Rename Reduce kernel as per convention - Move kept_dim and reduce_dims from runtime to compile-time parameters - Update Reduce2dProblem template to include KeptDim, ReduceDims, and Rank - Remove IsSupportedArgument validation function as it's unnecessary. Not using the GuaranteedLastDimensionVectorStride while making tensor view or descriptor which removes the bounds enforced earlier. We still calculate and use vector size. - Update reduce example to demonstrate NCHW->NHW reduction with non-contiguous support - Update tests Kernel now handles both contiguous and non-contiguous memory layout. * fix compile errors	2025-12-17 21:46:08 +02:00
linqunAMD	6d7299ff78	[ck_tile] remove duplicate functions in ck_tile (#3311 ) * [ck_tile] remove duplicated shuffle_b and shuffle_b_permuteN * [ck_tile] move get_k_warp to gemm_shape * resolve code rebase error	2025-12-15 07:13:00 -08:00
Bartłomiej Kocot	3b773109e5	[CK TILE][AICK-439] Fix cshuffle epilogue wave per shuffle (#3364 ) * [CK TILE] Fix cshufle epligoue wave per shuffle * Align shuffle per tile with smem * fixes * Fixes for double smem * fix	2025-12-15 12:59:48 +01:00
linqunAMD	fc7bf0ab1c	[CK_TILE] Port hw independent changes from internal repo to develop branch (#3301 ) * [CK_TILE] Port hw independent changes from internal repo to develop branch It includes PR#96, #114, #120, #121. * correct rebase error	2025-12-12 09:28:37 -08:00
Max Podkorytov	d184eed823	[CK-Tile] Refactor base pipeline usage (#3251 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder	2025-12-04 11:45:49 -08:00
Aviral Goel	004784ef98	chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 ) * chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>	2025-11-28 13:49:54 -08:00
arai713	24d88d2472	[CK_TILE] Move DataTypeTraits into a Common File (#3146 ) This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile. Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2025-11-27 09:09:54 -08:00
Max Podkorytov	79aae7c7f7	[CK Tile] enable building examples by default (#3259 ) * remove EXCLUDE_FROM_ALL from ck-tile examples -> +15 min build time w/ 64 threads for a single arch * fix cpp17 compile error in the ck-tile examples --------- Co-authored-by: khuagarw <khuagarw@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-11-26 16:24:44 -08:00
Aviral Goel	d85f065b15	chore(copyright): update copyright header for example directory (#3273 ) * chore(copyright): update copyright header for codegen directory * chore(copyright): update copyright header for example directory	2025-11-24 18:02:41 -08:00
Thomas Ning	de6a9590ab	Reorganize of KPack in GEMM (#3247 ) * add the reorganize of KPack * fix the compilation error * fix the compilation error	2025-11-24 12:38:59 -08:00
Khushbu Agarwal	fb41a7b73b	fixing ambiguous shuffle definitions (#3175 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-11-12 23:44:12 -08:00
Thomas Ning	9f33b7cfd3	fix input range (#3188 )	2025-11-10 11:08:41 -08:00
Emily Martins	2ec57a8e70	Replace CK_TILE_PIPELINE macros with a common enum This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3, CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum called GemmPipeline to reduce code duplication.	2025-11-03 09:35:05 -07:00
Aviral Goel	73f637894d	refactor: remove gemm preshuffle pipeline v1 by removing all references from codebase (#3132 ) * test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf * refactor: deprecate gemm preshuffle pipeline v1 by removing all references from codebase * Revert "test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf" This reverts commit `573c08a085`.	2025-11-02 00:06:28 -04:00
Jimniu	8c4cb4f9f4	Jimniu/ ck tile gemm stride validation (#2710 ) * Add stride validation for gemm_basic * change default stride statement * Fix build error * Fix pre-commit failure * Addressed PR comments * clear the redundant code * clang format --------- Co-authored-by: mkumar16-amd <mkumar16@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-10-29 19:45:09 -07:00
Vidyasagar Ananthan	92c67a824f	[DOCS] Documentation Addition (Readme updates) (#2495 ) * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-10-16 03:10:57 -07:00
aledudek	634634f5c0	[CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955 ) * First checkpoint * Second checkpoint - hot loop scheduler * Third checkpoint - init main operator * Fourth checkpoint - main loop ready * Fifth checkpoint - main loop fix * Sixth checkpoint - ReadWritecompFunc * Seventh checkpoint - Tail finished * [CK_TILE] Blockwise gemm pipeline v5 complete * Working * Working fixes 2 * Rename v5 to v77 temporarily * Data type adjustment * Data type adjustment 2 * [CK_TILE] Blockwise Gemm pipeline v5 add tests * [CK_TILE] Fix calculation error * TEMP: check pipeline * Fix name to V6 * naming and documentation changes * WIP dump * Try fixing v1 * Failing tests v5 * Debugging * Changes v2 * F16 tests working great * Working BlockwiseGemmPipelineV5 as V6 * Cleanup and format * Merging changes part1 * [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6 * Remove commented code * Fix gfx950 build issues * Fix file formatting * Review changes, more concat info, add bf16 bf8 tests * Fix formatting * Add bf16 and bf8 tests --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2025-10-13 13:57:37 +02:00
Khushbu Agarwal	3c39d279ab	supporting prefill shapes for preshuffle block scale gemm (#2975 ) * debugging * debugging for prefill shapes * comment unused code * fix for prefill shapes * clearing up the code * add int4 to universal gemm example * clang formatted * adding test for prefill shapes in block scale gemm * lil improv on the block pipeline * Address Review Comment --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-10-10 15:36:24 -07:00
Aviral Goel	19415d0b6f	fix: nil performance results for gemm examples (#2950 )	2025-10-06 12:43:23 -07:00
Hosang	28ad8ae5d8	Fix timing issue in CK_TILE GEMM example (#2940 )	2025-09-29 15:34:04 -07:00
lalala-sh	ee9769616a	fix wp gemm bug when permuteN is false (#2935 ) * fix wp gemm bug when permuteN is false * code clean --------- Co-authored-by: valarLip <340077269@qq.com>	2025-09-26 13:28:54 -07:00
Cong Ma	e5d73da2da	[CK TILE GEMM] set correct value to TiledMMAPermuteN_ (#2839 ) - TiledMMAPermuteN_ should be set to true when config if GemmConfigPreshufflePrefill	2025-09-13 20:54:08 -07:00
linqunAMD	b0ee317d83	[CK_TILE] Enable ck_tile tests on gfx11 and gfx12 (#2821 ) * [CK_TILE] Enable ck_tile test on gfx11 & gfx12 * revert an unnecessary change * enable pk_int4 on gfx11 & gfx12 * revert .pre-commit-config.yaml	2025-09-12 12:45:14 -07:00
linqunAMD	60d3e8f504	[CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 (#2808 ) * [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 * fix gemm_splitk_two_stage * revert .pre-commit-config.yaml	2025-09-11 07:27:33 -07:00
linqunAMD	0b9a638f26	[CK_TILE] fix example reduces, permute and elementwise on gfx11 & gfx12 (#2810 ) 1. Refine Reduce2dShape to support both wave32 and wave64 2. Fix example reduce, permute and elementwise on gfx11 and gfx12 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-09-11 12:41:20 +08:00
linqunAMD	df4ee556d6	[CK_TILE] Fix flatmm on gfx11 and gfx12 (#2790 ) 1. Correct shuffle_b and MakeBFlatDramTileDistribution according to WMMA warp layout 2. Add FlatmmConfig16_Wmma for gfx11 and gfx12	2025-09-10 08:28:00 +08:00
Cong Ma	82890192dd	[CK TILE] Support fp8/fp16 with pk_int4_t as data types for tensors A and B (#2805 ) - Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats - Implement A(bf8) B(i4) support in universal GEMM - Use new implementation for i4 to fp8 conversion in Block Scale	2025-09-09 16:40:52 -07:00
lalala-sh	75570d0fa8	[CK_TILE] Add permuteN optimization to remove lds operation in c_shuffle (#2764 ) * permuteN optimization to remove lds operation in c_shuffle * add the change log --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-09-08 22:02:48 -07:00
lalala-sh	e4a7728903	[CK_TILE] add atomic IGLP scheduler for wp gemm (#2739 ) * add atomic IGLP scheduler * clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-09-08 14:57:14 -07:00
Illia Silin	ef6c28e989	Fix latest AITER failure and add more AITER tests in CK CI. (#2782 ) * add aiter tests and move json_dump header * remove example/include path from cmake * extend time for aiter and pytorch stages	2025-09-04 13:44:00 -07:00
SamiAario-AMD	1acd8e041c	[CK Tile] gemm splitk two stage (#2697 ) * Fix a typo * Use std::variant to call run_gemm_example_with_layouts with the available layout variant combinations * Use a unified run_gemm_example_prec_type for basic gemm and universal gemm * Factor out run_gemm_example_prec_type * Refactor argument parsing in gemm_splitk_two_stage_reduce.cpp * Parse arguments outside of create_args * Move the gemm operators to separate structs to facilitate their reuse * Move the invokers to separate files to facilitate their reuse * Rename the invoker files for consistency with the examples that use them * Add fp32 support to the elementwise examples, and produce an error message for unsupported types * Get rid of four unused variables * Make two variables const * Add support for different input-output type combinations in elementwise examples * Test support for different input and output types in elementwise examples * Add support for different operations in the elementwise unary tests * Add support for UnaryConvert in the elementwise unary tests * Add support for bf16 in elementwise examples, excluding unsupported type combinations * Make some operator parameters const in ElementWiseKernel * Remove some unnecessary include statements * Implement a two-stage GEMM that does a type conversion in the second stage using the elementwise kernel * Clear workspace instead of output when flushing the cache in SplitKTwoStageInvoker::gemm * Fix formatting issues reported by clang * Add back CK_TILE_USE_WMMA related changes * Use the right prec type for bf16 in the universal GEMM and two stage split K examples * Add some brackets * Add some brackets * Separate the clearing of the GEMM output memory from the cache flushing in the universal GEMM example * Separate the clearing of the GEMM output memory from the cache flushing in the split K two stage example * Fix formatting * No need to call SetZero on ws_m_n_dev_buf here, as clear_gemm_output now does this as part of the kernel preprocessing * Add fp16 data type to splitk two stage example * Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the basic GEMM example	2025-09-04 14:33:44 +03:00
rahjain-amd	4d041837ad	Add json dump support to output details from CK/CKTile Examples. (#2551 ) * Adding RapidJson Library * Adding Json Dumps in all CK_Tile Examples Not verified yet * Adding json to cktile Batched Transpose * adding json dumps to layernorm2d_fwd * Adding json dump to flatmm_basic * Adding RapidJson Library * Adding Json Dumps in all CK_Tile Examples Not verified yet * Adding json to cktile Batched Transpose * adding json dumps to layernorm2d_fwd * Adding json dump to flatmm_basic * Adding json in 03_gemm * Add json dump to 16_batched_gemm * Add json dump to gemm_multi_d_fp16 * Add json dump to grouped_gemm * fix fmha_bwd/fwd * Fix clang-format errors exclude include/rapidjson in jenkins as its a third-party library * Saparating function and defination. * Update Documentation of 03_gemm * Refactoring as per code review * Disable fp8 instances on unsupported targets (#2592) * Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt * Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt * Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt * Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt --------- Co-authored-by: AviralGoelAMD <aviralgoel@amd.com> * fix clang format * remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt * Fixing Readme and unifying jsondumps * adding moe_smoothquant * adding fused_moe * Fixing Readme for batched_gemm * Fixing Readme for grouped_gemm * adding flatmm * adding gemm_multi_d_fp16 * adding elementwise * adding File name when json is dumped * Fixing Reduce after merge * adding batched_transpose * Adding Warptile in Gemm * Fixing Clang Format --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: AviralGoelAMD <aviralgoel@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-09-02 23:31:29 -07:00
Aviral Goel	f5f795c4d6	feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types (#2691 ) * feat(check_err): add a variable to adjust number of incorrect values to print * feat(host_tensor): add printing capability for fp8 bf8 int8 int4 * fix(gemm_utils): update acceptable data type * fix(host_tensor): print both 4 bit ints in pk_int4_t * refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp * feat(host_tensor): add print first n elements functions	2025-08-27 18:17:24 -07:00
Tianyuan Wu	e9605ed36d	[CK_TILE] Fix the Wrong Output Generated by Gemm Examples on GFX11/12 (#2713 ) * Introduce macro CK_TILE_USE_WMMA Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Make CK_TILE_USE_WMMA global for all examples Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Remove CK_TILE_USE_WMMA from config.hpp Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>	2025-08-25 12:55:35 -07:00
Max Podkorytov	f38751fc2a	invoke script directly (#2687 )	2025-08-19 00:23:07 -07:00

1 2 3

123 Commits