Khushbu Agarwal
bb5eeef2af
Fix for Add the API to load SGPR ( #2913 )
...
* Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878 )" (#2904 )"
This reverts commit 5cc40c160f .
* Fix: sgpr minor issue
* cyclic dependency resolved
* clang formatted
* removing unused variable
* clang formatted
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: b56e5d1d79 ]
2025-09-25 10:32:42 -07:00
ltqin
24a8daf662
fix fmha fwd kernel name ( #2880 )
...
* fix fmha fwd kernel name
* if the input and output types are the same, keep the original code
[ROCm/composable_kernel commit: ab22f91a7c ]
2025-09-24 20:00:10 -07:00
Yi DING
02db6094b9
[CK_TILE] FMHA BWD Add D96 Instances ( #2916 )
...
[ROCm/composable_kernel commit: fe0a47a011 ]
2025-09-24 17:04:23 +08:00
asleepzzz
5cc40c160f
Revert "[CK-Tile] Add the API to load SGPR ( #2878 )" ( #2904 )
...
This reverts commit fb5e953a05 .
[ROCm/composable_kernel commit: f161b5b738 ]
2025-09-23 14:33:51 -07:00
Haocong WANG
add2107be0
[FMHA FWD] gfx950 Accuracy enhancement & bug fix ( #2900 )
...
* disable cast_tile_pk_fp16_fp32 on gfx950
* fix wrong encoding when hdim is not exponentiation of 2
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 959df2a155 ]
2025-09-24 00:59:41 +08:00
Haocong WANG
0eede5af24
[CK_TILE] Fix fmha bwd ( #2865 )
...
* Fix fmha bwd filter
* remove unnecessary change
* enable test cases
---------
Co-authored-by: Yi DING <yi.ding@amd.com >
[ROCm/composable_kernel commit: 7b16782d7c ]
2025-09-23 19:59:27 +08:00
Thomas Ning
fb5e953a05
[CK-Tile] Add the API to load SGPR ( #2878 )
...
* Have a workable version for SGPR
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* substitute with the new sgpr read api
* update the CHANGELOG
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* change to static for logic
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
[ROCm/composable_kernel commit: 2cbbf5dcb3 ]
2025-09-23 01:23:56 -07:00
Haocong WANG
d85ca87d97
[CK_TILE] FMHA FWD bug fix ( #2888 )
...
* tempsave debug
* fix the bug in fmha fwd_kernel
* Remove unnecessary changes
* Fix the buggy part
* remove fmha fwd known failure cases
[ROCm/composable_kernel commit: b6e8994386 ]
2025-09-23 15:00:46 +08:00
Illia Silin
ee43f0f0be
Revert "[CK_TILE] Add sequence padding and variable length support in fmha (a…" ( #2883 )
...
This reverts commit 7ede589f4b .
[ROCm/composable_kernel commit: b765fe78f3 ]
2025-09-19 08:15:02 -07:00
Yi DING
240de6ee26
[CK_TILE] FMHA BWD Fix Decode Accuracy ( #2881 )
...
* [CK_TILE] FMHA BWD Fix Decode Accuracy
* use s_waitcnt utils
[ROCm/composable_kernel commit: 6cf3fdd21c ]
2025-09-19 21:45:02 +08:00
Jeff Huang
7ede589f4b
[CK_TILE] Add sequence padding and variable length support in fmha (a… ( #2851 )
...
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
These test cases and benchmarks that specifically validate/benchmark the
new padding and variable-length functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Yi DING <yi.ding@amd.com >
[ROCm/composable_kernel commit: 86dd59cd01 ]
2025-09-19 17:36:49 +08:00
Anton Gorenko
a64deec3ba
[CK_TILE] FMHA Fix synchronization issues in BWD pipelines ( #2876 )
...
* Run ctest with --output-on-failure
* Fix synchronization issues in bwd pipelines
The bwd kernel reuses the same area of LDS for ds (SGrad), bias and
dbias (BiasGrad). This means that there must be block_sync_lds between
loading one tensor and storing another to the same area.
Heavy instructions like MFMA/WMMA and global loads are executed between
reuses of the same memory so in MOST cases loading is finished by all
warps before storing is started. However, sometimes warps progress at
different speeds.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure
[ROCm/composable_kernel commit: 2aec38f9ec ]
2025-09-19 11:34:45 +05:00
ltqin
fd80c78f50
Add input fp8 and output bf16 attention ( #2726 )
...
* change host using fp16 to check
* fp8 to fp8 compare
* rewrite input parameters
* add not squant
* remove some output code
* for scale = 1
* format
* saturates only for fp8
* add fp8bf16 data type
* add fp8bf16 data type
* fix test fp8 code
* add run_fp8bf16_tests
* change fmha fwd example parameter(adding fp8bf16)
* Support fp8bf16 for Aiter
* Support aiter fp8bf16 in c++
* fix comment about fp8 in readme.md
* add fp8fp32
* add fp8fp32 test
* remove range_q etc.
* format
* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type
* add fp8bf16 to data_type function
* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8
* format
* reset atol for fp8
* fix bug for atol
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: dd249f1cd6 ]
2025-09-19 14:26:43 +08:00
Haocong WANG
4db9e47cd5
[CK_TILE] fix bug when iperm =0 in fmha fwd ( #2820 )
...
* fix bug when iperm =0 in fmha fwd
* Disable f8 fmha smoke test until fix pr merged
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: 59cb906482 ]
2025-09-16 15:07:10 +08:00
Po Yen Chen
9c6cca5bc5
[CK_TILE] FMHA FAv3 scheduling fine-tuning for performance ( #2833 )
...
* Re-mapping thread block indices for causal=True kernels
* Use more intuitive remap_opt value
* Fallback to origin remapping if seqlen_q >= 64K
* Use GenericAttentionMask to reduce mask computation
* Avoid unnecessary boundary check for IsMasking=false case
* Fix wrong kernel entry specifier
* Add s_nop to prevent delay wave0-3
* Refine scheduling
* Remove unnecessary sched_group_barrier()
* Move sched_group_barrier() call to scheduler
* Replace inline asm s_setprio with intrinsics
* Rephrase comments
* Expend some o_acc rescaling insts to avoid SIMD idle
* Fix block idx special mapping logic
* Tune block index mapping for causal=False cases
* Tune block index mapping for causal=True cases
* Fix wrong vmcnt()
* Remove parameter name
* Use boolean option for turn on/off causal mask
* Update benchmark_fwd_v3.sh option usages
* Add option if compiler support it
[ROCm/composable_kernel commit: 7fbc9d6c97 ]
2025-09-16 11:32:38 +08:00
Yi DING
22490acf0b
[CK_TILE] Fix kname & typo in FMHA BWD ( #2809 )
...
[ROCm/composable_kernel commit: 91178b4011 ]
2025-09-09 15:08:00 -07:00
Yi DING
cb02189b6e
Reapply "[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )" ( #2757 ) ( #2761 )
...
This reverts commit 3db57fc348 .
[ROCm/composable_kernel commit: 5ff205ca79 ]
2025-09-08 09:21:14 -07:00
Thomas Ning
8e30f972f6
[FIX] fix on fmha_bwd ( #2784 )
...
* fix on fmha_bwd
* Add 'const' to the Default2DEpilogue call operator
* Fix more calls to Default2DEpilogue
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com >
Co-authored-by: Yi DING <yi.ding@amd.com >
[ROCm/composable_kernel commit: 42a43d1523 ]
2025-09-08 14:31:27 +08:00
Po Yen Chen
793645c57c
[CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage ( #2765 )
...
* Fix Default2DEpilogue usage
* Fix Default2DEpilogue usage for batch_prefill
[ROCm/composable_kernel commit: 9f35cde374 ]
2025-09-02 09:51:56 -07:00
Haocong WANG
2133e2829a
Fix naming issue ( #2762 )
...
[ROCm/composable_kernel commit: 33418b201f ]
2025-09-02 11:18:53 +08:00
Po Yen Chen
7fcd094b8c
[CK_TILE] Add FAv3 fwd pipeline ( #2731 )
...
* Add FAv3 fwd pipeline
* Unpack v_pk_mul to hide v_mov
* Avoid compiler moving l compute across phase
* Sync sched_group_barrier() setting for masking cases
[ROCm/composable_kernel commit: d876e87fe4 ]
2025-09-01 09:16:45 +08:00
Mateusz Ozga
e4010d5ea1
[CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D ( #2752 )
...
* Init commit
* Quick fix, CI fails
* Remove CDElementWise
* Add CDEELementWise
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: 0758883fa4 ]
2025-08-28 12:45:50 -07:00
asleepzzz
3db57fc348
Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )" ( #2757 )
...
This reverts commit f2e6edde3b .
[ROCm/composable_kernel commit: 038ea82315 ]
2025-08-28 22:50:42 +08:00
Yi DING
f2e6edde3b
[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )
...
* 16x192
* Use buffer_load_lds for lse/d
* Dispatch & cleanup
* Avoid zeroing dq & fix
* fix
[ROCm/composable_kernel commit: ead4447b20 ]
2025-08-28 18:54:18 +08:00
Linjun-AMD
56395b127c
use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline ( #2711 )
...
* add k_lds padding and iglp to improve dim256 fmha fwd
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* update block_fmha_pipeline_qr_ks_vs.hpp
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp
* clang format
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* use same naming style
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
[ROCm/composable_kernel commit: bf7b458e6e ]
2025-08-28 11:39:39 +08:00
Yi DING
1f35e4c5b7
[CK_TILE] FMHA avoid unnecessary vmcnt0 ( #2715 )
...
* FMHA avoid unnecessary vmcnt0
Squashed commit of the following:
commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 22 03:15:51 2025 +0000
merge develop and solve conflicts
commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c
Merge: 2dac61a4f 5d56dde0e
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 22 03:15:21 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue
commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140
Author: Ding, Yi <yi.ding@amd.com >
Date: Tue Aug 19 02:17:43 2025 +0000
update bwd
commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682
Author: Kevin Choi <kevin.choi@amd.com >
Date: Mon Aug 18 19:36:38 2025 +0000
add restrict to applicable functions
commit 45534dee5bcbe532da46fc5cd6601cde10d84387
Author: Ding, Yi <yi.ding@amd.com >
Date: Mon Aug 18 02:07:03 2025 +0000
bwd filter
commit 7abd7b372b82cba94a457238b6b4a81d093e7280
Author: Kevin Choi <kevin.choi@amd.com >
Date: Sat Aug 16 08:15:23 2025 +0000
remove noinline attr as it causes a lot more s_waitcnt's
commit 89c29746a09255c1d26038171157e91d1b68d14a
Author: Kevin Choi <kevin.choi@amd.com >
Date: Thu Aug 14 12:11:17 2025 +0000
remove innerloop, move restrict parameters to mainloop and add noinline attribute.
commit 6f61b3a5c80011411aa3aebf7983602f7c117566
Author: Kevin Choi <kevin.choi@amd.com >
Date: Thu Aug 14 07:06:51 2025 +0000
Create inner lambda with restrict parameters, add restrict to some parameters
commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a
Author: aska-0096 <haocwang@amd.com >
Date: Thu Aug 14 03:43:54 2025 +0000
save for debug
commit 5f2c3cfa86c6951208a1cc227fa556704a885a88
Merge: 25f067b4f 165a2723c
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 13 02:15:22 2025 +0000
Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 25f067b4f09d6909a05e252c7621124046dfda57
Merge: 447c1c5d6 bd3b4afb9
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 13 02:14:26 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 165a2723c557420b48891cc1ce3434e3675aef5d
Merge: 447c1c5d6 be39c00a8
Author: asleepzzz <hanwen.chang@amd.com >
Date: Wed Aug 13 00:34:11 2025 +0800
Merge branch 'develop' into wip-async-tr-fa
commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 14:25:50 2025 +0000
refactor blockgemm change, isolate to v2;
commit 8f67083511ff77d31c880f4427d3bdf53a179568
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 09:26:13 2025 +0000
clang format
commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965
Merge: c84662c34 19ef22e56
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 09:04:41 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit c84662c345755ec5f3d524fdde4aa951c8f86298
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 08:46:06 2025 +0000
Fix the bug
commit e0647ffa5646f8132529b152af02750c4010013d
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 04:02:41 2025 +0000
fix conflict. disable all v-col instance for fmha fwd
commit 781f98236c376f57591a6d481cc2ee04b36a148b
Merge: 241f3d7dc 8cb8da53c
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 03:52:34 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 01:53:31 2025 +0000
clang format
commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f
Merge: 1a629c098 eda4a5e80
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 01:52:52 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 1a629c09876cc05f0750db7eade1d527dc32a1d3
Merge: f65874e5b 92c0435e2
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 15:59:40 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit f65874e5b07579d5b734b4c68877679a3ee04dac
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 15:37:37 2025 +0000
change the warp setting for hdim32 fmha fwd
commit 7c5f5e65e97486c074ef9a138900ed9aafea547e
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 14:21:09 2025 +0000
tempsave, update the blocksync functions
commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915
Author: aska-0096 <haocwang@amd.com >
Date: Sun Aug 10 06:00:51 2025 +0000
fix bug in pki4
commit 073db2e18af21f1ed1fb3d1f1c15830838df986f
Author: aska-0096 <haocwang@amd.com >
Date: Sat Aug 9 03:25:12 2025 +0000
fix bugs in gemm
commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 18:35:53 2025 +0000
fix bug on non-gfx950
commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 17:53:19 2025 +0000
fix bug
commit 30de97f473685e0bd5b82f15eee2493d9a05cffd
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 15:42:15 2025 +0000
fix bugs
commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:31:01 2025 +0000
fix clangformat with 18.1.3
commit e4cb185c41586d018771a5413efd909d8d53a8c5
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:07:40 2025 +0000
remove non-necessary change
commit 498f0d44cfba17287cce8d10855cce5c5de263db
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:04:02 2025 +0000
bug fix, clang format;
commit 3cb648cbc4883e6889340d85f48d803a21b9c805
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 08:08:03 2025 +0000
Remove unnecessary changes
commit 9e7ff3b611b7933b65973907a0cae312a15d31c6
Merge: a3c1bfe6d f4247d199
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 07:50:12 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 06:19:31 2025 +0000
remove unnecessary files; rename some files
commit 6c257fa27729c005d539b5b71deeba3703031089
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 05:46:18 2025 +0000
merge fa_decode pipeline into fmha_fwd api
commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 6 05:58:43 2025 +0000
add __restrict__ to tr load
commit bbad2b979b701533b74f43452ffe0f775e019139
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 5 07:23:51 2025 +0000
Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
commit d7fabd5f765e2a573ddbaf0857ce6f691407e562
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 10:27:42 2025 +0000
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 10:02:17 2025 +0000
add vmcnt guard before load ktile
commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 06:49:01 2025 +0000
Load Q through lds, implement xor;
commit 62bb9f05177dfb8280d6c2be67a88492d6be4838
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 1 10:44:54 2025 +0000
small refactor
commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 31 10:25:37 2025 +0000
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 31 05:13:27 2025 +0000
enable larger tile size; upgrade xor pattern
commit a468e59a01d6dd85c105ca30ac249491256c5915
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 30 12:25:33 2025 +0000
remove all lds bankconflict with xor layouts
commit 39ff55cdc377311112100fb24bc013adfd8960c0
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 30 03:51:06 2025 +0000
enable prefill overload operator().
commit a7b152a788e8035c93f8e4cbf317863182665d8f
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 25 07:10:01 2025 +0000
fix the lds alignment caused performance regression
commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 23 09:05:57 2025 +0000
remove unnecessary features
commit 9758750801c7fd5a80f654eb982f43b87d674fa3
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 22 08:04:05 2025 +0000
tempsave. asynccopy+trload sanity checked
commit 1c4c04d725047357224ebf8a2b94d9010a5651a6
Author: aska-0096 <haocwang@amd.com >
Date: Mon Jul 21 05:55:55 2025 +0000
tempsave, trload+asyncload done
commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 10:04:34 2025 +0000
compile pass
commit d41b5eace939909084d32281710fb81142ad5fec
Merge: 3f86a81ee 8c3766f0d
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 05:17:27 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 3f86a81eee75256a78df02032d50814aaa42b038
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 05:16:39 2025 +0000
tempsave
commit 7d43f7446a9a20773f70e08462393f6c9afb7280
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 10:06:09 2025 +0000
temp save
commit 727629cd9115f1be9c1800bb65a8ea84ff06c250
Merge: aa5da19c9 94bceebc9
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 07:24:32 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
commit 94bceebc96ef4885e0ac861b7793e2e2897481bd
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 03:10:46 2025 +0000
move test_copy into test
commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:41:31 2025 +0000
remove unnecessary output
commit b1dbcacb1832560c6cc967a079dffce558228f0b
Merge: 5b0d311e6 0eaf3325a
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:26:13 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit 5b0d311e649257557a7014c28fcfac0c327b77b5
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:26:10 2025 +0000
add input validation and bug fix
commit 0eaf3325a8e019402ff12a2402f446f8471f584f
Merge: a66e1d29a f77d70498
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Date: Wed Jul 16 11:23:57 2025 -0700
Merge branch 'develop' into test_copy_fix
commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 08:55:50 2025 +0000
fix vmcnt shift
commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 08:37:07 2025 +0000
Improve s_waitcnt_imm calculation
commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 05:39:50 2025 +0000
fix the s_waitcnt_imm calculation
commit 1c0870089a0e7c78ed71a278bf52d98fc780e482
Merge: d6ee05e36 92ada43ba
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 03:57:57 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 03:54:33 2025 +0000
Add block_sync_lds_direct_load utility
commit c037a72040217471f52ee76bed9c07bf5b22aef4
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 15 09:39:03 2025 +0000
fix async copytest bug
commit aa5da19c94022449b027e7a57668f2e219f0f171
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 10 04:29:33 2025 +0000
temp save, change all instance to 1wave
commit ddd172feb9eb2cb783420a8db6f44d51b350c370
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 8 08:37:20 2025 +0000
tempsave, fmha_decode
commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33
Author: aska-0096 <haocwang@amd.com >
Date: Sat Jun 21 15:02:57 2025 +0000
temp save, waiting for debug
commit 71dd31f15bca01995c8cb0be9e903103f4657181
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jun 19 05:11:52 2025 +0000
save an example for __bf16 type
commit cdf33e079fa7d7d5b03b06550df2356b02041d7b
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jun 18 07:27:24 2025 +0000
fix bwd code
commit d630998dc6751f44097b1e9a239bb5063a793736
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jun 18 06:37:16 2025 +0000
Fix for fwd/bwd kernel build filter
commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com >
Date: Tue Aug 19 08:13:18 2025 +0000
Add restrict to avoid unnecessary vmcnt
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
* Add comments for c-stype cast
* Better comments
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
[ROCm/composable_kernel commit: de61e55493 ]
2025-08-25 20:55:12 +08:00
John Shumway
1c8519a03b
Remove unsupported use of c++20 concept. ( #2719 )
...
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.
[ROCm/composable_kernel commit: c71d7ddd74 ]
2025-08-24 21:29:23 -07:00
Po Yen Chen
b87e256c42
[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() ( #2549 )
...
* Support async load dwordx4
* Enlarge load size on gfx950
[ROCm/composable_kernel commit: 4a7ecce096 ]
2025-08-22 10:13:47 +08:00
Yi DING
f885dd8b33
[CK_TILE] FMHA BWD Fix Compilation with Bias ( #2682 )
...
* [CK_TILE] FMHA BWD Fix Compilation with Bias
* Fix appendkv kApplyRoPE
[ROCm/composable_kernel commit: 4cfa2c7158 ]
2025-08-22 10:01:10 +08:00
linqunAMD
615ca9842d
Support Wave32 in CK_TILE - Part 1 ( #2594 )
...
* Support wave32/wave64 in CK_TILE - Part 1
* remove blocksize in kernel launch
* fix build error
* fix clang format
* fix clang format 2
* fix clang format 3
* fix fmha build error
* fix fmha build 2
* fix fmha build 3
* fix build error 4
* address review comment
* update change log
* replace KernelBlockSize with kBlockSize
* fix CI fail
* fix clang format
* address review comment and rebase code.
* fix universal test fail
---------
Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: 9fcc1ee9fd ]
2025-08-18 10:08:31 -07:00
Tianyuan Wu
abb90422b4
[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 ( #2466 )
...
* WMMA GEMM F16 Implementation
Signed-off-by: root <tianyuwu@amd.com >
* Self-review
Signed-off-by: root <tianyuwu@amd.com >
* ASIC check minor tweak
Signed-off-by: root <tianyuwu@amd.com >
* add missing include file
* Set GPU_TARGETS to gfx11/12 generic
Signed-off-by: root <tianyuwu@amd.com >
* INT8 GFX12
Signed-off-by: root <tianyuwu@amd.com >
* add int8x16 branch
* Fix CI script
Signed-off-by: root <tianyuwu@amd.com >
* Fix typo
Signed-off-by: root <tianyuwu@amd.com >
* Add CK_Tile WMMA example
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Fix CI
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* fix clang format
* Set M/N_Warp Back to Constant
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Use GemmConfigComputeV3 by default
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Remove CK_Tile wmma gemm examples from the CI list
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add atomic add fallback method for gfx11
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Omit copyright year
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Support non-square cases
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add get_device_ip()
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Add atomic add fallback method for gfx11"
This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746.
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"
This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5.
* Revise method name and typos
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Try fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Try fix CI"
This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902.
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo caused by merge
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Fix typo caused by merging
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
---------
Signed-off-by: root <tianyuwu@amd.com >
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 68134b60e4 ]
2025-08-15 16:22:27 -07:00
SamiAario-AMD
8a32077ccd
Cleanups ( #2631 )
...
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
[ROCm/composable_kernel commit: 28a97865f5 ]
2025-08-13 10:12:08 +02:00
Haocong WANG
e3e8de3477
Re-enable optimization for gfx950 fmha fwd ( #2671 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
* refactor blockgemm change, isolate to v2;
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 05a6e92705 ]
2025-08-13 14:57:43 +08:00
asleepzzz
4730050182
Revert "Optimize fmha fwd decode & prefill for gfx950 ( #2641 )" ( #2670 )
...
This reverts commit 747d127983 .
[ROCm/composable_kernel commit: 5b39de4bb6 ]
2025-08-12 20:27:10 +08:00
Haocong WANG
747d127983
Optimize fmha fwd decode & prefill for gfx950 ( #2641 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
[ROCm/composable_kernel commit: b7322a521a ]
2025-08-12 19:43:14 +08:00
Yi DING
19ef22e567
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
[ROCm/composable_kernel commit: 8e1eb0c1ee ]
2025-08-12 17:02:52 +08:00
Yi DING
8cb8da53c9
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
[ROCm/composable_kernel commit: 4fde1646e5 ]
2025-08-12 11:11:55 +08:00
Yi DING
6fdfcdb180
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
[ROCm/composable_kernel commit: b0a97498b0 ]
2025-08-07 21:24:43 +08:00
Illia Silin
bbe875838f
fix the clang-format ( #2578 )
...
[ROCm/composable_kernel commit: 49723e94bb ]
2025-07-28 20:49:55 -07:00
Yi DING
38bb8bee29
[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 ( #2130 )
...
* Fix shuffle_tile
* Add fmha bwd d160
* CHANGELOG
* Use static_cast
* Update
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 1926cd0cb8 ]
2025-07-29 09:31:14 +08:00
Illia Silin
24f228df3b
upgrade from clang-format-12 to clang-format-18 ( #2568 )
...
* upgrade to clang-format-18
* update to clang-format-18 in pre-commit-config
[ROCm/composable_kernel commit: 504b101da3 ]
2025-07-28 11:34:07 -07:00
shay-li77
9960927572
fix mha bwd dbias random mismatch ( #2570 )
...
* fix mha bwd dbias random mismatch
* formatting code
[ROCm/composable_kernel commit: 8ae528a1b4 ]
2025-07-28 14:39:31 +08:00
liang
a6d55da47f
reorder grid dim schedule ( #2533 )
...
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: d2459878cf ]
2025-07-26 02:46:55 +08:00
Po Yen Chen
e500919c52
Revert "Eliminate warning caused by failed to meet occupancy requirement ( #2389 )" ( #2514 )
...
This reverts commit e35e3dee86 .
[ROCm/composable_kernel commit: 722c22fb15 ]
2025-07-17 10:09:01 +08:00
Qianfeng
fb42be79dc
Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) ( #2487 )
...
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com >
[ROCm/composable_kernel commit: 45904b8fd7 ]
2025-07-11 18:14:47 +08:00
shay-li77
0a1eb8381d
support y-direction step length greater than 1 for SimplifiedGenericAttentionMask ( #2338 )
...
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
[ROCm/composable_kernel commit: d814fefe18 ]
2025-07-09 23:18:55 +08:00
Haocong WANG
4b6049f553
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
[ROCm/composable_kernel commit: 5557eadce6 ]
2025-07-08 10:42:07 +08:00
Po Yen Chen
e35e3dee86
Eliminate warning caused by failed to meet occupancy requirement ( #2389 )
...
Co-authored-by: felix <felix.li@amd.com >
[ROCm/composable_kernel commit: b2dea90116 ]
2025-07-08 09:17:25 +08:00
ltqin
273c031eca
ck tile pagedkv prefill ( #2405 )
...
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: 9f4c5d7372 ]
2025-07-07 16:16:54 +08:00
Po Yen Chen
b86c92c84e
[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface ( #2403 )
...
* Rename batch_prerfill interface
* Add min_seqlen_q parameter in MakeKargs()
[ROCm/composable_kernel commit: 50fad03524 ]
2025-06-25 15:19:21 +08:00