composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
linqunAMD	eaf1fa7edb	[CK_TILE] fix example reduces, permute and elementwise on gfx11 & gfx12 (#2810 ) 1. Refine Reduce2dShape to support both wave32 and wave64 2. Fix example reduce, permute and elementwise on gfx11 and gfx12 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `0b9a638f26`]	2025-09-11 12:41:20 +08:00
Khushbu Agarwal	2eb6cbb6a8	[CK-Tile] Fix quant example code (#2813 ) * initial commit * remove extra files * fixing errors * updated ReadMe file for mapping of diff quants with diff configs * addressing review comments * addressing review comments * Resolved merge conflicts * [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled The get_preshuffle_or was not working as expected, which led to incorrect behavior in the quantization preshuffle process. This change replaces it with the more reliable is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied. --------- Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `80a61afb9b`]	2025-09-10 17:15:39 -07:00
linqunAMD	5b17f135b7	[CK_TILE] Refine Generic2dBlockShape to fix ck_tile example 2,10,11,14 on rdna3 and 4 (#2795 ) BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side. Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape [ROCm/composable_kernel commit: `c254f3d7b4`]	2025-09-10 08:29:20 +08:00
linqunAMD	7022f756e6	[CK_TILE] Fix flatmm on gfx11 and gfx12 (#2790 ) 1. Correct shuffle_b and MakeBFlatDramTileDistribution according to WMMA warp layout 2. Add FlatmmConfig16_Wmma for gfx11 and gfx12 [ROCm/composable_kernel commit: `df4ee556d6`]	2025-09-10 08:28:00 +08:00
Cong Ma	f7ffd111ee	[CK TILE] Support fp8/fp16 with pk_int4_t as data types for tensors A and B (#2805 ) - Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats - Implement A(bf8) B(i4) support in universal GEMM - Use new implementation for i4 to fp8 conversion in Block Scale [ROCm/composable_kernel commit: `82890192dd`]	2025-09-09 16:40:52 -07:00
Yi DING	22490acf0b	[CK_TILE] Fix kname & typo in FMHA BWD (#2809 ) [ROCm/composable_kernel commit: `91178b4011`]	2025-09-09 15:08:00 -07:00
lalala-sh	f8c8263798	[CK_TILE] Add permuteN optimization to remove lds operation in c_shuffle (#2764 ) * permuteN optimization to remove lds operation in c_shuffle * add the change log --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `75570d0fa8`]	2025-09-08 22:02:48 -07:00
lalala-sh	b2f2800468	[CK_TILE] add atomic IGLP scheduler for wp gemm (#2739 ) * add atomic IGLP scheduler * clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `e4a7728903`]	2025-09-08 14:57:14 -07:00
kyle-256	61e2fe063e	[CK_TILE] Implement Row/Col quant grouped gemm (#2786 ) * Add cshuffle epilogue test * add the poc implementation to the epilogue and tests * refactor cshuffle epilogue * WIP: adding tensor/tile usage to scale_tile * fix usage of tile_elementwise_inout * add gemm_quant_kernel for generalizing gemm quant kernel * Add problem specific to different quants, add QuantType to Traits * Add quant_type to quant_kernel template parameters * Create aq/bq_block_windows and views depending on QuantType * Use tile windows as inputs in cshuffle epilogue * Fix some issues in epilogue * initial new example code for new general gemm quant kernel test * Fix issues in kernel * Add verification check for rowcol Quantmode * use AccDataType instead of AQ in pipeline * fix aquant preshuffle * fix formatting * some cleanup * remove gemm_aquant_basic.cpp * remove gemm_aquant_kernel.hpp * fix tests for the renamed quant kernel * fix formatting * clean example files * fix some merge conflicts * fix preshufflequant rename issue * updating * fix some templates after merging with develop * fix test preshuffle parameter * fix formatting * updating kernels * change update user * test username * update quant_grouped_gemm example * update example * Unify bquant kernel to the common quant kernel * remove bquant kernel also from common header * fix formatting * clean up commented code * update grouped_gemm_quant example * fix formatting config hpp * fix merge mistake * Non-const for movable windows * fix formatting * update tileloop pipleline * Fix grammar in README Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Remove #include<bit> and clean up example * fix strides * Add some descriptions for move_windows * fix tensor print bug * update quant_grouped_gemm example * remove useless code * cleanup code * clean up code & format code * fix compile & running bug in grouped_gemm example --------- Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: liyingli <liyingli@amd.com> Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `4eb415829e`]	2025-09-08 10:25:57 -07:00
Yi DING	cb02189b6e	Reapply "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 )" (#2757 ) (#2761 ) This reverts commit `3db57fc348`. [ROCm/composable_kernel commit: `5ff205ca79`]	2025-09-08 09:21:14 -07:00
Thomas Ning	8e30f972f6	[FIX] fix on fmha_bwd (#2784 ) * fix on fmha_bwd * Add 'const' to the Default2DEpilogue call operator * Fix more calls to Default2DEpilogue --------- Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `42a43d1523`]	2025-09-08 14:31:27 +08:00
Aviral Goel	70f4b54dfd	feat(grouped_gemm): add preshuffle v2 support to grouped gemm example (#2721 ) * docs(README): update readme with new build instructions * feat(grouped_gemm): add support back for non persistent kernel * refactor(grouped_gemm): simplify tensor creation * refactor(grouped_gemm): Persistance is now GemmConfig value for easier management * chore(grouped_gemm): add print statements to ease debugging * WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration * fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity * WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines * chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle * style: clang format work so far * BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline * fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors * refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm * chore(grouped_gemm): add/remove debug comments and debug print statements * feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes * chore(gemm_profile): add new argument combinations * fix: branch cleanup, formatting, refactoring * fix: branch cleanup, formatting, refactoring * chore(changelog): update changelog to reflect new featuer * address review comments & nit [ROCm/composable_kernel commit: `e279e9420e`]	2025-09-07 14:18:35 -07:00
Ville Pietilä	8b4be3a0ed	[CK Tile] Fix building grouped conv examples in CK Tile (#2777 ) * Fix compilation of the grouped conv examples. * Fix grouped conv bwd weight example output in CK Tile. [ROCm/composable_kernel commit: `83f607e2a6`]	2025-09-05 09:14:21 +03:00
Sami Remes	abf4f7a7b2	[CK_TILE] Row/Col quant gemm (#2729 ) * Add cshuffle epilogue test * add the poc implementation to the epilogue and tests * refactor cshuffle epilogue * WIP: adding tensor/tile usage to scale_tile * fix usage of tile_elementwise_inout * add gemm_quant_kernel for generalizing gemm quant kernel * Add problem specific to different quants, add QuantType to Traits * Add quant_type to quant_kernel template parameters * Create aq/bq_block_windows and views depending on QuantType * Use tile windows as inputs in cshuffle epilogue * Fix some issues in epilogue * initial new example code for new general gemm quant kernel test * Fix issues in kernel * Add verification check for rowcol Quantmode * use AccDataType instead of AQ in pipeline * fix aquant preshuffle * fix formatting * some cleanup * remove gemm_aquant_basic.cpp * remove gemm_aquant_kernel.hpp * fix tests for the renamed quant kernel * fix formatting * clean example files * fix some merge conflicts * fix preshufflequant rename issue * fix some templates after merging with develop * fix test preshuffle parameter * fix formatting * Unify bquant kernel to the common quant kernel * remove bquant kernel also from common header * fix formatting * clean up commented code * fix formatting config hpp * fix merge mistake * Non-const for movable windows * fix formatting * Fix grammar in README Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Remove #include<bit> and clean up example * fix strides * Add some descriptions for move_windows --------- Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `c6010f2953`]	2025-09-04 16:17:12 -07:00
SamiAario-AMD	e4ac6bca96	[CK Tile] gemm splitk two stage (#2697 ) * Fix a typo * Use std::variant to call run_gemm_example_with_layouts with the available layout variant combinations * Use a unified run_gemm_example_prec_type for basic gemm and universal gemm * Factor out run_gemm_example_prec_type * Refactor argument parsing in gemm_splitk_two_stage_reduce.cpp * Parse arguments outside of create_args * Move the gemm operators to separate structs to facilitate their reuse * Move the invokers to separate files to facilitate their reuse * Rename the invoker files for consistency with the examples that use them * Add fp32 support to the elementwise examples, and produce an error message for unsupported types * Get rid of four unused variables * Make two variables const * Add support for different input-output type combinations in elementwise examples * Test support for different input and output types in elementwise examples * Add support for different operations in the elementwise unary tests * Add support for UnaryConvert in the elementwise unary tests * Add support for bf16 in elementwise examples, excluding unsupported type combinations * Make some operator parameters const in ElementWiseKernel * Remove some unnecessary include statements * Implement a two-stage GEMM that does a type conversion in the second stage using the elementwise kernel * Clear workspace instead of output when flushing the cache in SplitKTwoStageInvoker::gemm * Fix formatting issues reported by clang * Add back CK_TILE_USE_WMMA related changes * Use the right prec type for bf16 in the universal GEMM and two stage split K examples * Add some brackets * Add some brackets * Separate the clearing of the GEMM output memory from the cache flushing in the universal GEMM example * Separate the clearing of the GEMM output memory from the cache flushing in the split K two stage example * Fix formatting * No need to call SetZero on ws_m_n_dev_buf here, as clear_gemm_output now does this as part of the kernel preprocessing * Add fp16 data type to splitk two stage example * Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the basic GEMM example [ROCm/composable_kernel commit: `1acd8e041c`]	2025-09-04 14:33:44 +03:00
arai713	f7bd3efade	[CK TILE] Stream-K tile partitioner (#2708 ) * initial commit for skeleton code * replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code * fixed up code to match CK Tile convention: data type changes, naming changes, etc. * change for num_sk_blocks data type * formatting fix * minor fixes * moved reduction argument to template * resolved comments from PR review: standardizing naming, pruning unneeded code * resolve errors from merge of device op PR: moved enum to common file * switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems * unsigned type fix * add const qualifier * added documentation for template parameters * documentation edit [ROCm/composable_kernel commit: `0282d98412`]	2025-09-03 13:38:17 -07:00
msaffari-amd	c48a0b3a9b	refactor: use snake_case naming in ck_tile/core components (#2766 ) [ROCm/composable_kernel commit: `47d020a993`]	2025-09-03 09:34:11 +02:00
Cong Ma	f33470644b	[CK TILE GEMM] Fix building issues (#2772 ) - Add `WarpGemmMfma_f32_16x16x128_[fp8\|bf8]_[fp8\|bf8]_CTransposed` - Replace `__gfx950__` with `CK_GFX950_SUPPORT` [ROCm/composable_kernel commit: `e1ab460d2d`]	2025-09-02 22:40:18 -07:00
Po Yen Chen	793645c57c	[CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765 ) * Fix Default2DEpilogue usage * Fix Default2DEpilogue usage for batch_prefill [ROCm/composable_kernel commit: `9f35cde374`]	2025-09-02 09:51:56 -07:00
Michael Mcminn	6fe28f408c	Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751 ) * Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16 * Adding support for offload target gfx9-4-generic * This duplication here isn't ideal [ROCm/composable_kernel commit: `022f369deb`]	2025-09-02 10:35:07 +02:00
Haocong WANG	2133e2829a	Fix naming issue (#2762 ) [ROCm/composable_kernel commit: `33418b201f`]	2025-09-02 11:18:53 +08:00
Po Yen Chen	7fcd094b8c	[CK_TILE] Add FAv3 fwd pipeline (#2731 ) * Add FAv3 fwd pipeline * Unpack v_pk_mul to hide v_mov * Avoid compiler moving l compute across phase * Sync sched_group_barrier() setting for masking cases [ROCm/composable_kernel commit: `d876e87fe4`]	2025-09-01 09:16:45 +08:00
Aviral Goel	d148d716cd	chore(gemm): clang format to pass CI (#2758 ) [ROCm/composable_kernel commit: `fcff0043ae`]	2025-08-29 00:38:46 -07:00
Vijay Krish	8d7941a356	ck_tile kernel for gemm with groupwise quantized B tensor. (#2663 ) * This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. fp8, fp8 -> f32 bf8, bf8 -> f32 fp8, i4 -> f32 bf8, i4 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. * Solve merge conflict * [CK TILE] Update CHANGELOG.md --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `4208e28988`]	2025-08-28 23:43:02 -07:00
Cong Ma	37daef0e83	Support transposed C tile in Aquant (#2679 ) The performance of Aquant has increased after enabling transposed C. Do not need to exchange AQ elements among lanes after enabling transposed C as one thread only holds data from one row. [ROCm/composable_kernel commit: `428090f749`]	2025-08-28 13:28:09 -07:00
Mateusz Ozga	e4010d5ea1	[CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752 ) * Init commit * Quick fix, CI fails * Remove CDElementWise * Add CDEELementWise --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `0758883fa4`]	2025-08-28 12:45:50 -07:00
asleepzzz	3db57fc348	Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 )" (#2757 ) This reverts commit `f2e6edde3b`. [ROCm/composable_kernel commit: `038ea82315`]	2025-08-28 22:50:42 +08:00
linqunAMD	4fab7b0a30	[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728 ) * fix copy basic build error * fix other ck tile test build error [ROCm/composable_kernel commit: `4a49dac7c6`]	2025-08-28 20:30:30 +08:00
Yi DING	f2e6edde3b	[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 ) * 16x192 * Use buffer_load_lds for lse/d * Dispatch & cleanup * Avoid zeroing dq & fix * fix [ROCm/composable_kernel commit: `ead4447b20`]	2025-08-28 18:54:18 +08:00
Linjun-AMD	56395b127c	use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711 ) * add k_lds padding and iglp to improve dim256 fmha fwd * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp * clang format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * use same naming style --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `bf7b458e6e`]	2025-08-28 11:39:39 +08:00
Cong Ma	871f68f227	[CK TILE GEMM] Fix a merge conflict (#2753 ) * Fixed a merge conflict in `39b254c1` * Foramt the code [ROCm/composable_kernel commit: `cd53e2e57e`]	2025-08-27 11:08:09 -07:00
Cong Ma	39b254c126	[CK TILE] Fix bugs in AQuant preshuffle (#2700 ) * [CK TILE] Fix bugs in AQuant preshuffle - Make Aquant works with block Mx64x256. `M` could be 16, 32, 64 - Make Aquant works with warp 16x16x32 and 32x32x16. * [CK TILE] Rename Preshuffle to PreshuffleQuant The new name, PreshuffleQuant, explicitly states the function's purpose: to preshuffle the quantization matrix. * [CK TILE Block Scale] Use GemmConfig to save tile properties - Remove specialization of GemmQuantTypeConfig - Pass GemmConfig around which contains tile properties. Stop using hard coded tile properties in `gemm_calc_aquant()` * [CK TILE Block Scale] Rename GemmConfig used in block scale - Remove unused GemmConfig - Rename GemmConfig used in block scale --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `245467f359`]	2025-08-27 00:05:54 -07:00
John Afaganis	cef79d5f82	Revert "[CK-TILE] Default epilogue, adding support for D (#2629 )" (#2746 ) This reverts commit `92037686ae`. [ROCm/composable_kernel commit: `508e7912f9`]	2025-08-26 09:48:49 -07:00
Mateusz Ozga	92037686ae	[CK-TILE] Default epilogue, adding support for D (#2629 ) * Extend 2d-epilogue, D support * Added tests & update * Remove unused attribute * Extend tests --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `d43228fbca`]	2025-08-25 19:29:35 -07:00
Yi DING	1f35e4c5b7	[CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715 ) * FMHA avoid unnecessary vmcnt0 Squashed commit of the following: commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:51 2025 +0000 merge develop and solve conflicts commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c Merge: 2dac61a4f `5d56dde0e` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:21 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140 Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 02:17:43 2025 +0000 update bwd commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682 Author: Kevin Choi <kevin.choi@amd.com> Date: Mon Aug 18 19:36:38 2025 +0000 add restrict to applicable functions commit 45534dee5bcbe532da46fc5cd6601cde10d84387 Author: Ding, Yi <yi.ding@amd.com> Date: Mon Aug 18 02:07:03 2025 +0000 bwd filter commit 7abd7b372b82cba94a457238b6b4a81d093e7280 Author: Kevin Choi <kevin.choi@amd.com> Date: Sat Aug 16 08:15:23 2025 +0000 remove noinline attr as it causes a lot more s_waitcnt's commit 89c29746a09255c1d26038171157e91d1b68d14a Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 12:11:17 2025 +0000 remove innerloop, move restrict parameters to mainloop and add noinline attribute. commit 6f61b3a5c80011411aa3aebf7983602f7c117566 Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 07:06:51 2025 +0000 Create inner lambda with restrict parameters, add restrict to some parameters commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a Author: aska-0096 <haocwang@amd.com> Date: Thu Aug 14 03:43:54 2025 +0000 save for debug commit 5f2c3cfa86c6951208a1cc227fa556704a885a88 Merge: 25f067b4f 165a2723c Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:15:22 2025 +0000 Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 25f067b4f09d6909a05e252c7621124046dfda57 Merge: 447c1c5d6 `bd3b4afb9` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:14:26 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 165a2723c557420b48891cc1ce3434e3675aef5d Merge: 447c1c5d6 `be39c00a8` Author: asleepzzz <hanwen.chang@amd.com> Date: Wed Aug 13 00:34:11 2025 +0800 Merge branch 'develop' into wip-async-tr-fa commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 14:25:50 2025 +0000 refactor blockgemm change, isolate to v2; commit 8f67083511ff77d31c880f4427d3bdf53a179568 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:26:13 2025 +0000 clang format commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965 Merge: c84662c34 `19ef22e56` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:04:41 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit c84662c345755ec5f3d524fdde4aa951c8f86298 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 08:46:06 2025 +0000 Fix the bug commit e0647ffa5646f8132529b152af02750c4010013d Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 04:02:41 2025 +0000 fix conflict. disable all v-col instance for fmha fwd commit 781f98236c376f57591a6d481cc2ee04b36a148b Merge: 241f3d7dc `8cb8da53c` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 03:52:34 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:53:31 2025 +0000 clang format commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f Merge: 1a629c098 `eda4a5e80` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:52:52 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 1a629c09876cc05f0750db7eade1d527dc32a1d3 Merge: f65874e5b `92c0435e2` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:59:40 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit f65874e5b07579d5b734b4c68877679a3ee04dac Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:37:37 2025 +0000 change the warp setting for hdim32 fmha fwd commit 7c5f5e65e97486c074ef9a138900ed9aafea547e Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 14:21:09 2025 +0000 tempsave, update the blocksync functions commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915 Author: aska-0096 <haocwang@amd.com> Date: Sun Aug 10 06:00:51 2025 +0000 fix bug in pki4 commit 073db2e18af21f1ed1fb3d1f1c15830838df986f Author: aska-0096 <haocwang@amd.com> Date: Sat Aug 9 03:25:12 2025 +0000 fix bugs in gemm commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 18:35:53 2025 +0000 fix bug on non-gfx950 commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 17:53:19 2025 +0000 fix bug commit 30de97f473685e0bd5b82f15eee2493d9a05cffd Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 15:42:15 2025 +0000 fix bugs commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:31:01 2025 +0000 fix clangformat with 18.1.3 commit e4cb185c41586d018771a5413efd909d8d53a8c5 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:07:40 2025 +0000 remove non-necessary change commit 498f0d44cfba17287cce8d10855cce5c5de263db Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:04:02 2025 +0000 bug fix, clang format; commit 3cb648cbc4883e6889340d85f48d803a21b9c805 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 08:08:03 2025 +0000 Remove unnecessary changes commit 9e7ff3b611b7933b65973907a0cae312a15d31c6 Merge: a3c1bfe6d `f4247d199` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 07:50:12 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 06:19:31 2025 +0000 remove unnecessary files; rename some files commit 6c257fa27729c005d539b5b71deeba3703031089 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 05:46:18 2025 +0000 merge fa_decode pipeline into fmha_fwd api commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449 Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 6 05:58:43 2025 +0000 add __restrict__ to tr load commit bbad2b979b701533b74f43452ffe0f775e019139 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 5 07:23:51 2025 +0000 Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug commit d7fabd5f765e2a573ddbaf0857ce6f691407e562 Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:27:42 2025 +0000 Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9 Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:02:17 2025 +0000 add vmcnt guard before load ktile commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 06:49:01 2025 +0000 Load Q through lds, implement xor; commit 62bb9f05177dfb8280d6c2be67a88492d6be4838 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 1 10:44:54 2025 +0000 small refactor commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 10:25:37 2025 +0000 upgrade prefill pipeline; simple iglp; consistent data produce and consume order commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 05:13:27 2025 +0000 enable larger tile size; upgrade xor pattern commit a468e59a01d6dd85c105ca30ac249491256c5915 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 12:25:33 2025 +0000 remove all lds bankconflict with xor layouts commit 39ff55cdc377311112100fb24bc013adfd8960c0 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 03:51:06 2025 +0000 enable prefill overload operator(). commit a7b152a788e8035c93f8e4cbf317863182665d8f Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 25 07:10:01 2025 +0000 fix the lds alignment caused performance regression commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 23 09:05:57 2025 +0000 remove unnecessary features commit 9758750801c7fd5a80f654eb982f43b87d674fa3 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 22 08:04:05 2025 +0000 tempsave. asynccopy+trload sanity checked commit 1c4c04d725047357224ebf8a2b94d9010a5651a6 Author: aska-0096 <haocwang@amd.com> Date: Mon Jul 21 05:55:55 2025 +0000 tempsave, trload+asyncload done commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 10:04:34 2025 +0000 compile pass commit d41b5eace939909084d32281710fb81142ad5fec Merge: 3f86a81ee `8c3766f0d` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:17:27 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 3f86a81eee75256a78df02032d50814aaa42b038 Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:16:39 2025 +0000 tempsave commit 7d43f7446a9a20773f70e08462393f6c9afb7280 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 10:06:09 2025 +0000 temp save commit 727629cd9115f1be9c1800bb65a8ea84ff06c250 Merge: aa5da19c9 94bceebc9 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 07:24:32 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline commit 94bceebc96ef4885e0ac861b7793e2e2897481bd Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 03:10:46 2025 +0000 move test_copy into test commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:41:31 2025 +0000 remove unnecessary output commit b1dbcacb1832560c6cc967a079dffce558228f0b Merge: 5b0d311e6 0eaf3325a Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:13 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix commit 5b0d311e649257557a7014c28fcfac0c327b77b5 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:10 2025 +0000 add input validation and bug fix commit 0eaf3325a8e019402ff12a2402f446f8471f584f Merge: a66e1d29a `f77d70498` Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Date: Wed Jul 16 11:23:57 2025 -0700 Merge branch 'develop' into test_copy_fix commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:55:50 2025 +0000 fix vmcnt shift commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:37:07 2025 +0000 Improve s_waitcnt_imm calculation commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 05:39:50 2025 +0000 fix the s_waitcnt_imm calculation commit 1c0870089a0e7c78ed71a278bf52d98fc780e482 Merge: d6ee05e36 `92ada43ba` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:57:57 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:54:33 2025 +0000 Add block_sync_lds_direct_load utility commit c037a72040217471f52ee76bed9c07bf5b22aef4 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 15 09:39:03 2025 +0000 fix async copytest bug commit aa5da19c94022449b027e7a57668f2e219f0f171 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 10 04:29:33 2025 +0000 temp save, change all instance to 1wave commit ddd172feb9eb2cb783420a8db6f44d51b350c370 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 8 08:37:20 2025 +0000 tempsave, fmha_decode commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33 Author: aska-0096 <haocwang@amd.com> Date: Sat Jun 21 15:02:57 2025 +0000 temp save, waiting for debug commit 71dd31f15bca01995c8cb0be9e903103f4657181 Author: aska-0096 <haocwang@amd.com> Date: Thu Jun 19 05:11:52 2025 +0000 save an example for __bf16 type commit cdf33e079fa7d7d5b03b06550df2356b02041d7b Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 07:27:24 2025 +0000 fix bwd code commit d630998dc6751f44097b1e9a239bb5063a793736 Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 06:37:16 2025 +0000 Fix for fwd/bwd kernel build filter commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 08:13:18 2025 +0000 Add restrict to avoid unnecessary vmcnt --------- Co-authored-by: aska-0096 <haocwang@amd.com> * Add comments for c-stype cast * Better comments --------- Co-authored-by: aska-0096 <haocwang@amd.com> [ROCm/composable_kernel commit: `de61e55493`]	2025-08-25 20:55:12 +08:00
John Shumway	1c8519a03b	Remove unsupported use of c++20 concept. (#2719 ) Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable. [ROCm/composable_kernel commit: `c71d7ddd74`]	2025-08-24 21:29:23 -07:00
Po Yen Chen	b87e256c42	[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549 ) * Support async load dwordx4 * Enlarge load size on gfx950 [ROCm/composable_kernel commit: `4a7ecce096`]	2025-08-22 10:13:47 +08:00
Yi DING	f885dd8b33	[CK_TILE] FMHA BWD Fix Compilation with Bias (#2682 ) * [CK_TILE] FMHA BWD Fix Compilation with Bias * Fix appendkv kApplyRoPE [ROCm/composable_kernel commit: `4cfa2c7158`]	2025-08-22 10:01:10 +08:00
Bartłomiej Kocot	092ba92f26	[CK Tile] Grouped convolution backward data (#2652 ) * base working version for single groupped conv bwd data * Fix 2d descriptor * fix groups * Add 3d support * fixes * fixes * fixes --------- Co-authored-by: Jakub Piasecki <jakpia21@gmail.com> [ROCm/composable_kernel commit: `4212bbc170`]	2025-08-20 05:29:57 -07:00
Emily Martins	aa964be163	[CK Tile] Stream K GEMM Kernel HostArgs and Kernel Classes (#2681 ) * CK Tile Stream K Device Ops Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The StreamKKernel class injects Universal Gemm and includes functions to facilitate kernel preparation for the GPU. * Stream K Device Ops Fixes - Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to ensure we get size needed for accumulation buffers and semaphores. - Pass in num_sk_blocks into TilePartitioner constructor - Update documentation * Add WarpTile dimensions to GetName function in StreamKKernel class * Fix typos in StreamKHostArgs class description. Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Apply clang format on updated comment for StreamKHostArgs * Explicitly specify type for StreamKReductionStrategy enum * Remove unecessary scopes * Unify the commenting style to inline comments * Add explicit casts for occupancy and num_cu in MakeKernelArgs function Both the static functions Occupancy and NumCU in the StreamKKernel class use functions from the HIP API that result in the returned occupancy and num_cu types being type int. The TilePartitioner interface for stream K will have occupancy and num_cu being type ck_tile::index_t which is int32_t. Thus, to be safe, this change ensures that both occupancy and num_cu are cast to int32_t. * Fix use of kentry due to interface update PR #2594 updated the interface for the kentry function in include/ck_tile/host/kernel_launch.hpp. As a result, the static function Occupancy was updated to work correctly with the new interface. PR #2594 also changed UniversalGemmKernel's KernelBlockSize static variable to kBlockSize, so the StreamKKernel class was updated to reflect this change. * Switch type of num_sk_blocks from uint32_t to int32_t This change switches the type of num_sk_blocks to type ck_tile::index_t which is int32_t. This was done because parallel work for the CK Tile StreamK TilePartitioner's constructor will have num_sk_blocks as ck_tile::index_t. Thus, this change will help unify the interfaces to avoid any type conversion errors. --------- Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> [ROCm/composable_kernel commit: `071165919f`]	2025-08-19 15:08:52 -06:00
joyeamd	f9e76244d8	fix grouped gemm example when wave32 enabled (#2707 ) 1, delete some unused variables 2, fix BlockSize when wave32 enabled [ROCm/composable_kernel commit: `a1589a9667`]	2025-08-19 16:20:43 +08:00
linqunAMD	615ca9842d	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `9fcc1ee9fd`]	2025-08-18 10:08:31 -07:00
Tianyuan Wu	a4d70b6e13	Fix CI build error (#2695 ) Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> [ROCm/composable_kernel commit: `7310830d14`]	2025-08-18 01:45:40 -07:00
Tianyuan Wu	abb90422b4	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `68134b60e4`]	2025-08-15 16:22:27 -07:00
Yashvardhan Agarwal	b13aa4217e	CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) (#2632 ) * CK_TILE: Implement two-stage split-K GEMM with reduction - Added split-K GEMM with reduction example * comment resolutions [ROCm/composable_kernel commit: `7f14772406`]	2025-08-14 10:18:52 +02:00
Aviral Goel	e371bb31f5	Minor Improvements in CK TILE memory copy EXAMPLE (#2678 ) * Rename vector to ThreadTile * more notes on tile encoding * remove number<> from tuple of make_tile_window * add script to stress test the copy example [ROCm/composable_kernel commit: `8a698c7445`]	2025-08-13 15:24:16 -07:00
joyeamd	d6347da784	[CK_TILE]fix elementwise example in gfx11/12 (#2676 ) * fix elementwise examples * improve the robust * fix ck_tile's elementwise test * update elementwise test [ROCm/composable_kernel commit: `bcc38deff7`]	2025-08-13 15:21:46 -07:00
SamiAario-AMD	8a32077ccd	Cleanups (#2631 ) * Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp * Simplify two templated operator calls by having the templated types deduced automatically * Simplify two GemmPipeline calls * Fix GemmPipelineAgBgCrCompV4::GetName * Refactor use of ArgParser in CK tile GEMM examples * Update args in README.md to match the implementation in create_args * Remove some unnecessary include statements * Rename two variables * Factor out common code * Factor out do_verify * Add and use type aliases for memory operation integral constants * In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `28a97865f5`]	2025-08-13 10:12:08 +02:00
Haocong WANG	e3e8de3477	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `05a6e92705`]	2025-08-13 14:57:43 +08:00
Cong Ma	402b3f445b	Preshuffle AQ matrix in block scale gemm (#2624 ) * Preshuffle AQ matrix in block scale gemm * turns the output to fp16. Increase the repetition time. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `452791a3ba`]	2025-08-12 21:32:51 -07:00

1 2 3 4 5 ...

291 Commits