composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 05:01:25 +00:00

Author	SHA1	Message	Date
Anton Gorenko	1edd250115	[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836 ) * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Add F32 MFMA warp gemms * Support f32 in fwd FMHA * Implement transpose_vectors for 4-byte types (float) * Fix unexpected implicit f32->uint32 cast in buffer_store<4> __builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint). mbuf_t types in other buffer_store<> are changed for consistency. * Support F32 in bwd FMHA hdim = 256 is disabled for now because it uses too much memory on gfx90a * Support Headdim = 48 (divisible by 16) in fwd * Add fp32-specific receipts (800 and 801) * Tune fwd tiles * Tune bwd tiles * Use small tiles only for small seqlen_q * Fix after rebasing * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Remove constraints and adjust filtering for fp32 Custom constraints are no longer needed because now the smallest tile is selected automtically based on seqlen_q. Filters related to qr_async_trload disabled valid fp32 tiles. * Add fp32 tests * Make splitkv and appendkv compile for fp32 only There are no instances yet, but API still must compile when only fp32 is requested. * Remove unimportant f32 instances * Add test_ck_tile_fmha__fp32 to REGRESSION_TESTS Replace magic numbers with a constant, improve comments for dropout * Update changelog * Fix condition that dq_acc must be set to zero when mask is used The change was introduced in #2799 * Replace warp_uniform with recently added amd_wave_read_first_lane * Add hdim = 96 and 192 to fwd	2025-09-27 18:03:48 +05:00
Anton Gorenko	c6bfd97c2d	[CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934 ) * Fix validation of rotary embedding with time_kernel_ When rotary embedding is used, the appendkv kernel modifies the q tensor (multiple times when time_kernel_ is set). We need to reset the q buffer and rerun all kernels. * Fix synchronization issue in splitkv combine pipeline Different warps can read and then rewrite the same values of lse_acc_lds. Sometimes warps progress at different speeds, one warp can rewrite values that are still being read by another warp. Running the tests multiple times and, preferably, with multiple processes on the same GPU helps to trigger this issue: bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/KV"	2025-09-27 08:16:10 +05:00
Yi DING	32773fe5cb	[CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918 )	2025-09-26 16:42:59 +08:00
Jeff Huang	518d24e662	Add sequence padding and variable length support in fmha (#2932 ) * * [CK_TILE] Add sequence padding and variable length support in fmha (and v3) - Group Mode Padding: Introduces the `-s_qpad` argument to support physically padded layouts. Kernels now use padded start pointers (`seqstart_padded__ptr`) for memory addressing. - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens` arguments for efficient processing of variable-length sequences by passing cumulative effective lengths (`cu_seqlen__ptr`) to the kernel. - FMHA examples: Support padding and variable length both in group and batch mode. Dispatcher is updated as well (dispatch to kPadSeqLenK enabled pipeline). - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that specifically validate/benchmark the new padding and variable-length functionalities in both group and batch modes. * [CK_TILE] Fix build error in fmha unit tests * [CK_TILE] add mqa, gqa to sequence padding unit tests * [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI * [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel	2025-09-26 12:36:27 +08:00
Khushbu Agarwal	b56e5d1d79	Fix for Add the API to load SGPR (#2913 ) * Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)" This reverts commit `f161b5b738`. * Fix: sgpr minor issue * cyclic dependency resolved * clang formatted * removing unused variable * clang formatted --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-09-25 10:32:42 -07:00
ltqin	ab22f91a7c	fix fmha fwd kernel name (#2880 ) * fix fmha fwd kernel name * if the input and output types are the same, keep the original code	2025-09-24 20:00:10 -07:00
Yi DING	fe0a47a011	[CK_TILE] FMHA BWD Add D96 Instances (#2916 )	2025-09-24 17:04:23 +08:00
asleepzzz	f161b5b738	Revert "[CK-Tile] Add the API to load SGPR (#2878 )" (#2904 ) This reverts commit `2cbbf5dcb3`.	2025-09-23 14:33:51 -07:00
Haocong WANG	959df2a155	[FMHA FWD] gfx950 Accuracy enhancement & bug fix (#2900 ) * disable cast_tile_pk_fp16_fp32 on gfx950 * fix wrong encoding when hdim is not exponentiation of 2 --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-09-24 00:59:41 +08:00
Haocong WANG	7b16782d7c	[CK_TILE] Fix fmha bwd (#2865 ) * Fix fmha bwd filter * remove unnecessary change * enable test cases --------- Co-authored-by: Yi DING <yi.ding@amd.com>	2025-09-23 19:59:27 +08:00
Thomas Ning	2cbbf5dcb3	[CK-Tile] Add the API to load SGPR (#2878 ) * Have a workable version for SGPR * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * substitute with the new sgpr read api * update the CHANGELOG * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * change to static for logic * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.	2025-09-23 01:23:56 -07:00
Haocong WANG	b6e8994386	[CK_TILE] FMHA FWD bug fix (#2888 ) * tempsave debug * fix the bug in fmha fwd_kernel * Remove unnecessary changes * Fix the buggy part * remove fmha fwd known failure cases	2025-09-23 15:00:46 +08:00
Illia Silin	b765fe78f3	Revert "[CK_TILE] Add sequence padding and variable length support in fmha (a…" (#2883 ) This reverts commit `86dd59cd01`.	2025-09-19 08:15:02 -07:00
Yi DING	6cf3fdd21c	[CK_TILE] FMHA BWD Fix Decode Accuracy (#2881 ) * [CK_TILE] FMHA BWD Fix Decode Accuracy * use s_waitcnt utils	2025-09-19 21:45:02 +08:00
Jeff Huang	86dd59cd01	[CK_TILE] Add sequence padding and variable length support in fmha (a… (#2851 ) * [CK_TILE] Add sequence padding and variable length support in fmha (and v3) - Group Mode Padding: Introduces the `-s_qpad` argument to support physically padded layouts. Kernels now use padded start pointers (`seqstart_padded__ptr`) for memory addressing. - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens` arguments for efficient processing of variable-length sequences by passing cumulative effective lengths (`cu_seqlen__ptr`) to the kernel. - FMHA examples: Support padding and variable length both in group and batch mode. Dispatcher is updated as well (dispatch to kPadSeqLenK enabled pipeline). - New padding test cases: Add padding test cases to `smoke_test_fwd.sh`, and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that specifically validate/benchmark the new padding and variable-length functionalities in both group and batch modes. * [CK_TILE] Fix build error in fmha unit tests --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com>	2025-09-19 17:36:49 +08:00
Anton Gorenko	2aec38f9ec	[CK_TILE] FMHA Fix synchronization issues in BWD pipelines (#2876 ) * Run ctest with --output-on-failure * Fix synchronization issues in bwd pipelines The bwd kernel reuses the same area of LDS for ds (SGrad), bias and dbias (BiasGrad). This means that there must be block_sync_lds between loading one tensor and storing another to the same area. Heavy instructions like MFMA/WMMA and global loads are executed between reuses of the same memory so in MOST cases loading is finished by all warps before storing is started. However, sometimes warps progress at different speeds. Running the tests multiple times and, preferably, with multiple processes on the same GPU helps to trigger this issue: bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure	2025-09-19 11:34:45 +05:00
ltqin	dd249f1cd6	Add input fp8 and output bf16 attention (#2726 ) * change host using fp16 to check * fp8 to fp8 compare * rewrite input parameters * add not squant * remove some output code * for scale = 1 * format * saturates only for fp8 * add fp8bf16 data type * add fp8bf16 data type * fix test fp8 code * add run_fp8bf16_tests * change fmha fwd example parameter(adding fp8bf16) * Support fp8bf16 for Aiter * Support aiter fp8bf16 in c++ * fix comment about fp8 in readme.md * add fp8fp32 * add fp8fp32 test * remove range_q etc. * format * fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type * add fp8bf16 to data_type function * change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8 * format * reset atol for fp8 * fix bug for atol --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-09-19 14:26:43 +08:00
Haocong WANG	59cb906482	[CK_TILE] fix bug when iperm =0 in fmha fwd (#2820 ) * fix bug when iperm =0 in fmha fwd * Disable f8 fmha smoke test until fix pr merged --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-09-16 15:07:10 +08:00
Po Yen Chen	7fbc9d6c97	[CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833 ) * Re-mapping thread block indices for causal=True kernels * Use more intuitive remap_opt value * Fallback to origin remapping if seqlen_q >= 64K * Use GenericAttentionMask to reduce mask computation * Avoid unnecessary boundary check for IsMasking=false case * Fix wrong kernel entry specifier * Add s_nop to prevent delay wave0-3 * Refine scheduling * Remove unnecessary sched_group_barrier() * Move sched_group_barrier() call to scheduler * Replace inline asm s_setprio with intrinsics * Rephrase comments * Expend some o_acc rescaling insts to avoid SIMD idle * Fix block idx special mapping logic * Tune block index mapping for causal=False cases * Tune block index mapping for causal=True cases * Fix wrong vmcnt() * Remove parameter name * Use boolean option for turn on/off causal mask * Update benchmark_fwd_v3.sh option usages * Add option if compiler support it	2025-09-16 11:32:38 +08:00
Yi DING	91178b4011	[CK_TILE] Fix kname & typo in FMHA BWD (#2809 )	2025-09-09 15:08:00 -07:00
Yi DING	5ff205ca79	Reapply "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 )" (#2757 ) (#2761 ) This reverts commit `038ea82315`.	2025-09-08 09:21:14 -07:00
Thomas Ning	42a43d1523	[FIX] fix on fmha_bwd (#2784 ) * fix on fmha_bwd * Add 'const' to the Default2DEpilogue call operator * Fix more calls to Default2DEpilogue --------- Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com>	2025-09-08 14:31:27 +08:00
Po Yen Chen	9f35cde374	[CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765 ) * Fix Default2DEpilogue usage * Fix Default2DEpilogue usage for batch_prefill	2025-09-02 09:51:56 -07:00
Haocong WANG	33418b201f	Fix naming issue (#2762 )	2025-09-02 11:18:53 +08:00
Po Yen Chen	d876e87fe4	[CK_TILE] Add FAv3 fwd pipeline (#2731 ) * Add FAv3 fwd pipeline * Unpack v_pk_mul to hide v_mov * Avoid compiler moving l compute across phase * Sync sched_group_barrier() setting for masking cases	2025-09-01 09:16:45 +08:00
Mateusz Ozga	0758883fa4	[CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752 ) * Init commit * Quick fix, CI fails * Remove CDElementWise * Add CDEELementWise --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-28 12:45:50 -07:00
asleepzzz	038ea82315	Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 )" (#2757 ) This reverts commit `ead4447b20`.	2025-08-28 22:50:42 +08:00
Yi DING	ead4447b20	[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 ) * 16x192 * Use buffer_load_lds for lse/d * Dispatch & cleanup * Avoid zeroing dq & fix * fix	2025-08-28 18:54:18 +08:00
Linjun-AMD	bf7b458e6e	use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711 ) * add k_lds padding and iglp to improve dim256 fmha fwd * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp * clang format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * use same naming style --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-28 11:39:39 +08:00
Yi DING	de61e55493	[CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715 ) * FMHA avoid unnecessary vmcnt0 Squashed commit of the following: commit `7bdf6a7eef` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:51 2025 +0000 merge develop and solve conflicts commit `f21e916a8c` Merge: `a7dd2a7d1` `0db21053e` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:21 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue commit `a7dd2a7d13` Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 02:17:43 2025 +0000 update bwd commit `380aa8f311` Author: Kevin Choi <kevin.choi@amd.com> Date: Mon Aug 18 19:36:38 2025 +0000 add restrict to applicable functions commit `b85daba2a3` Author: Ding, Yi <yi.ding@amd.com> Date: Mon Aug 18 02:07:03 2025 +0000 bwd filter commit `75c4b9372f` Author: Kevin Choi <kevin.choi@amd.com> Date: Sat Aug 16 08:15:23 2025 +0000 remove noinline attr as it causes a lot more s_waitcnt's commit `598e3fec41` Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 12:11:17 2025 +0000 remove innerloop, move restrict parameters to mainloop and add noinline attribute. commit `3340408537` Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 07:06:51 2025 +0000 Create inner lambda with restrict parameters, add restrict to some parameters commit `3bc45ecbc7` Author: aska-0096 <haocwang@amd.com> Date: Thu Aug 14 03:43:54 2025 +0000 save for debug commit `de4db6c4c5` Merge: `108abf00e` `68694cb78` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:15:22 2025 +0000 Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `108abf00e0` Merge: `0810799e2` `0f42a92fc` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:14:26 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `68694cb781` Merge: `0810799e2` `20288caa2` Author: asleepzzz <hanwen.chang@amd.com> Date: Wed Aug 13 00:34:11 2025 +0800 Merge branch 'develop' into wip-async-tr-fa commit `0810799e25` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 14:25:50 2025 +0000 refactor blockgemm change, isolate to v2; commit `fd1eb323af` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:26:13 2025 +0000 clang format commit `75f6f6bac4` Merge: `bcc05eee6` `8e1eb0c1e` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:04:41 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `bcc05eee62` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 08:46:06 2025 +0000 Fix the bug commit `96d24497f5` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 04:02:41 2025 +0000 fix conflict. disable all v-col instance for fmha fwd commit `1716171be4` Merge: `1c9800790` `4fde1646e` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 03:52:34 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `1c98007901` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:53:31 2025 +0000 clang format commit `f43e903b1d` Merge: `3868ddd70` `a7badc6ec` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:52:52 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `3868ddd708` Merge: `498d234ab` `191c62967` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:59:40 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `498d234ab8` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:37:37 2025 +0000 change the warp setting for hdim32 fmha fwd commit `b86f7786e2` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 14:21:09 2025 +0000 tempsave, update the blocksync functions commit `7b8052d7ca` Author: aska-0096 <haocwang@amd.com> Date: Sun Aug 10 06:00:51 2025 +0000 fix bug in pki4 commit `76cbbb84a2` Author: aska-0096 <haocwang@amd.com> Date: Sat Aug 9 03:25:12 2025 +0000 fix bugs in gemm commit `8c101ccb88` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 18:35:53 2025 +0000 fix bug on non-gfx950 commit `efb8549279` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 17:53:19 2025 +0000 fix bug commit `729e8785fb` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 15:42:15 2025 +0000 fix bugs commit `250dc13c75` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:31:01 2025 +0000 fix clangformat with 18.1.3 commit `106edeecd9` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:07:40 2025 +0000 remove non-necessary change commit `78edd7303b` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:04:02 2025 +0000 bug fix, clang format; commit `3b9fb6af38` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 08:08:03 2025 +0000 Remove unnecessary changes commit `6bb57c2c57` Merge: `1ecee378d` `ab2602683` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 07:50:12 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `1ecee378d5` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 06:19:31 2025 +0000 remove unnecessary files; rename some files commit `b4640a9de6` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 05:46:18 2025 +0000 merge fa_decode pipeline into fmha_fwd api commit `fe63a646a4` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 6 05:58:43 2025 +0000 add __restrict__ to tr load commit `414cad667b` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 5 07:23:51 2025 +0000 Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug commit `0d12fc944f` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:27:42 2025 +0000 Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA commit `4f31847de1` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:02:17 2025 +0000 add vmcnt guard before load ktile commit `746f4ccb99` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 06:49:01 2025 +0000 Load Q through lds, implement xor; commit `2d4e73d2b4` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 1 10:44:54 2025 +0000 small refactor commit `a28b6e67fe` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 10:25:37 2025 +0000 upgrade prefill pipeline; simple iglp; consistent data produce and consume order commit `75cba48682` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 05:13:27 2025 +0000 enable larger tile size; upgrade xor pattern commit `69890afc98` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 12:25:33 2025 +0000 remove all lds bankconflict with xor layouts commit `8dacc35c4c` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 03:51:06 2025 +0000 enable prefill overload operator(). commit `13bcc913de` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 25 07:10:01 2025 +0000 fix the lds alignment caused performance regression commit `af28123cec` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 23 09:05:57 2025 +0000 remove unnecessary features commit `14e0ab70c6` Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 22 08:04:05 2025 +0000 tempsave. asynccopy+trload sanity checked commit `1b468bac0b` Author: aska-0096 <haocwang@amd.com> Date: Mon Jul 21 05:55:55 2025 +0000 tempsave, trload+asyncload done commit `afd96d8180` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 10:04:34 2025 +0000 compile pass commit `5616551115` Merge: `ae39c84f5` `095393276` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:17:27 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `ae39c84f55` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:16:39 2025 +0000 tempsave commit `94b6430489` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 10:06:09 2025 +0000 temp save commit `7e330553dc` Merge: `18669925c` `804f77dce` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 07:24:32 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline commit `804f77dce5` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 03:10:46 2025 +0000 move test_copy into test commit `21627d7ca7` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:41:31 2025 +0000 remove unnecessary output commit `287792c44a` Merge: `a4221db30` `21fd7e953` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:13 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix commit `a4221db304` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:10 2025 +0000 add input validation and bug fix commit `21fd7e9538` Merge: `d6df7bf85` `6e76b8205` Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Date: Wed Jul 16 11:23:57 2025 -0700 Merge branch 'develop' into test_copy_fix commit `d6df7bf851` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:55:50 2025 +0000 fix vmcnt shift commit `40e039e4e4` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:37:07 2025 +0000 Improve s_waitcnt_imm calculation commit `c30f8b709b` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 05:39:50 2025 +0000 fix the s_waitcnt_imm calculation commit `ec0a45b29f` Merge: `e5cc4af80` `6b09f0823` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:57:57 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix commit `e5cc4af808` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:54:33 2025 +0000 Add block_sync_lds_direct_load utility commit `eea58629cf` Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 15 09:39:03 2025 +0000 fix async copytest bug commit `18669925cc` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 10 04:29:33 2025 +0000 temp save, change all instance to 1wave commit `18686cfe5b` Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 8 08:37:20 2025 +0000 tempsave, fmha_decode commit `47565f21a5` Author: aska-0096 <haocwang@amd.com> Date: Sat Jun 21 15:02:57 2025 +0000 temp save, waiting for debug commit `e0a634ef97` Author: aska-0096 <haocwang@amd.com> Date: Thu Jun 19 05:11:52 2025 +0000 save an example for __bf16 type commit `4bd5fd4a3c` Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 07:27:24 2025 +0000 fix bwd code commit `69809d9513` Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 06:37:16 2025 +0000 Fix for fwd/bwd kernel build filter commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 08:13:18 2025 +0000 Add restrict to avoid unnecessary vmcnt --------- Co-authored-by: aska-0096 <haocwang@amd.com> * Add comments for c-stype cast * Better comments --------- Co-authored-by: aska-0096 <haocwang@amd.com>	2025-08-25 20:55:12 +08:00
John Shumway	c71d7ddd74	Remove unsupported use of c++20 concept. (#2719 ) Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.	2025-08-24 21:29:23 -07:00
Po Yen Chen	4a7ecce096	[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549 ) * Support async load dwordx4 * Enlarge load size on gfx950	2025-08-22 10:13:47 +08:00
Yi DING	4cfa2c7158	[CK_TILE] FMHA BWD Fix Compilation with Bias (#2682 ) * [CK_TILE] FMHA BWD Fix Compilation with Bias * Fix appendkv kApplyRoPE	2025-08-22 10:01:10 +08:00
linqunAMD	9fcc1ee9fd	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-18 10:08:31 -07:00
Tianyuan Wu	68134b60e4	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit `07a79e797d`. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit `ceee918007`. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit `7a7241085e`. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-08-15 16:22:27 -07:00
SamiAario-AMD	28a97865f5	Cleanups (#2631 ) * Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp * Simplify two templated operator calls by having the templated types deduced automatically * Simplify two GemmPipeline calls * Fix GemmPipelineAgBgCrCompV4::GetName * Refactor use of ArgParser in CK tile GEMM examples * Update args in README.md to match the implementation in create_args * Remove some unnecessary include statements * Rename two variables * Factor out common code * Factor out do_verify * Add and use type aliases for memory operation integral constants * In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-08-13 10:12:08 +02:00
Haocong WANG	05a6e92705	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-08-13 14:57:43 +08:00
asleepzzz	5b39de4bb6	Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 ) This reverts commit `b7322a521a`.	2025-08-12 20:27:10 +08:00
Haocong WANG	b7322a521a	Optimize fmha fwd decode & prefill for gfx950 (#2641 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-08-12 19:43:14 +08:00
Yi DING	8e1eb0c1ee	[CK_TILE] FMHA BWD Decode Pipeline (#2643 ) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2	2025-08-12 17:02:52 +08:00
Yi DING	4fde1646e5	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp	2025-08-12 11:11:55 +08:00
Yi DING	b0a97498b0	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad	2025-08-07 21:24:43 +08:00
Illia Silin	49723e94bb	fix the clang-format (#2578 )	2025-07-28 20:49:55 -07:00
Yi DING	1926cd0cb8	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-29 09:31:14 +08:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
shay-li77	8ae528a1b4	fix mha bwd dbias random mismatch (#2570 ) * fix mha bwd dbias random mismatch * formatting code	2025-07-28 14:39:31 +08:00
liang	d2459878cf	reorder grid dim schedule (#2533 ) Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-26 02:46:55 +08:00
Po Yen Chen	722c22fb15	Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389 )" (#2514 ) This reverts commit `b2dea90116`.	2025-07-17 10:09:01 +08:00
Qianfeng	45904b8fd7	Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487 ) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com>	2025-07-11 18:14:47 +08:00
shay-li77	d814fefe18	support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338 ) * mask support ratio for y axis * format code * add notes for param y_ratio * fix comments error * support template and mdiv for ratio mask * refactor y-ratio mask constructor * optimize coordinate calculation * add SimplifiedRatioAttentionMask	2025-07-09 23:18:55 +08:00

1 2 3

125 Commits