Yi DING
a29eac5da0
[CK_TILE] FMHA avoid unnecessary vmcnt0 ( #2715 )
...
* FMHA avoid unnecessary vmcnt0
Squashed commit of the following:
commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 22 03:15:51 2025 +0000
merge develop and solve conflicts
commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c
Merge: 2dac61a4f 13a6816fb
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 22 03:15:21 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue
commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140
Author: Ding, Yi <yi.ding@amd.com >
Date: Tue Aug 19 02:17:43 2025 +0000
update bwd
commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682
Author: Kevin Choi <kevin.choi@amd.com >
Date: Mon Aug 18 19:36:38 2025 +0000
add restrict to applicable functions
commit 45534dee5bcbe532da46fc5cd6601cde10d84387
Author: Ding, Yi <yi.ding@amd.com >
Date: Mon Aug 18 02:07:03 2025 +0000
bwd filter
commit 7abd7b372b82cba94a457238b6b4a81d093e7280
Author: Kevin Choi <kevin.choi@amd.com >
Date: Sat Aug 16 08:15:23 2025 +0000
remove noinline attr as it causes a lot more s_waitcnt's
commit 89c29746a09255c1d26038171157e91d1b68d14a
Author: Kevin Choi <kevin.choi@amd.com >
Date: Thu Aug 14 12:11:17 2025 +0000
remove innerloop, move restrict parameters to mainloop and add noinline attribute.
commit 6f61b3a5c80011411aa3aebf7983602f7c117566
Author: Kevin Choi <kevin.choi@amd.com >
Date: Thu Aug 14 07:06:51 2025 +0000
Create inner lambda with restrict parameters, add restrict to some parameters
commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a
Author: aska-0096 <haocwang@amd.com >
Date: Thu Aug 14 03:43:54 2025 +0000
save for debug
commit 5f2c3cfa86c6951208a1cc227fa556704a885a88
Merge: 25f067b4f 165a2723c
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 13 02:15:22 2025 +0000
Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 25f067b4f09d6909a05e252c7621124046dfda57
Merge: 447c1c5d6 2ad2f97b7
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 13 02:14:26 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 165a2723c557420b48891cc1ce3434e3675aef5d
Merge: 447c1c5d6 4491739ab
Author: asleepzzz <hanwen.chang@amd.com >
Date: Wed Aug 13 00:34:11 2025 +0800
Merge branch 'develop' into wip-async-tr-fa
commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 14:25:50 2025 +0000
refactor blockgemm change, isolate to v2;
commit 8f67083511ff77d31c880f4427d3bdf53a179568
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 09:26:13 2025 +0000
clang format
commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965
Merge: c84662c34 245071bcf
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 09:04:41 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit c84662c345755ec5f3d524fdde4aa951c8f86298
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 08:46:06 2025 +0000
Fix the bug
commit e0647ffa5646f8132529b152af02750c4010013d
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 04:02:41 2025 +0000
fix conflict. disable all v-col instance for fmha fwd
commit 781f98236c376f57591a6d481cc2ee04b36a148b
Merge: 241f3d7dc 6e03d9607
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 03:52:34 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 01:53:31 2025 +0000
clang format
commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f
Merge: 1a629c098 3639befe9
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 01:52:52 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 1a629c09876cc05f0750db7eade1d527dc32a1d3
Merge: f65874e5b b34a029cd
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 15:59:40 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit f65874e5b07579d5b734b4c68877679a3ee04dac
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 15:37:37 2025 +0000
change the warp setting for hdim32 fmha fwd
commit 7c5f5e65e97486c074ef9a138900ed9aafea547e
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 14:21:09 2025 +0000
tempsave, update the blocksync functions
commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915
Author: aska-0096 <haocwang@amd.com >
Date: Sun Aug 10 06:00:51 2025 +0000
fix bug in pki4
commit 073db2e18af21f1ed1fb3d1f1c15830838df986f
Author: aska-0096 <haocwang@amd.com >
Date: Sat Aug 9 03:25:12 2025 +0000
fix bugs in gemm
commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 18:35:53 2025 +0000
fix bug on non-gfx950
commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 17:53:19 2025 +0000
fix bug
commit 30de97f473685e0bd5b82f15eee2493d9a05cffd
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 15:42:15 2025 +0000
fix bugs
commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:31:01 2025 +0000
fix clangformat with 18.1.3
commit e4cb185c41586d018771a5413efd909d8d53a8c5
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:07:40 2025 +0000
remove non-necessary change
commit 498f0d44cfba17287cce8d10855cce5c5de263db
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:04:02 2025 +0000
bug fix, clang format;
commit 3cb648cbc4883e6889340d85f48d803a21b9c805
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 08:08:03 2025 +0000
Remove unnecessary changes
commit 9e7ff3b611b7933b65973907a0cae312a15d31c6
Merge: a3c1bfe6d 7f14bd1df
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 07:50:12 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 06:19:31 2025 +0000
remove unnecessary files; rename some files
commit 6c257fa27729c005d539b5b71deeba3703031089
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 05:46:18 2025 +0000
merge fa_decode pipeline into fmha_fwd api
commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 6 05:58:43 2025 +0000
add __restrict__ to tr load
commit bbad2b979b701533b74f43452ffe0f775e019139
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 5 07:23:51 2025 +0000
Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
commit d7fabd5f765e2a573ddbaf0857ce6f691407e562
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 10:27:42 2025 +0000
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 10:02:17 2025 +0000
add vmcnt guard before load ktile
commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 06:49:01 2025 +0000
Load Q through lds, implement xor;
commit 62bb9f05177dfb8280d6c2be67a88492d6be4838
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 1 10:44:54 2025 +0000
small refactor
commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 31 10:25:37 2025 +0000
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 31 05:13:27 2025 +0000
enable larger tile size; upgrade xor pattern
commit a468e59a01d6dd85c105ca30ac249491256c5915
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 30 12:25:33 2025 +0000
remove all lds bankconflict with xor layouts
commit 39ff55cdc377311112100fb24bc013adfd8960c0
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 30 03:51:06 2025 +0000
enable prefill overload operator().
commit a7b152a788e8035c93f8e4cbf317863182665d8f
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 25 07:10:01 2025 +0000
fix the lds alignment caused performance regression
commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 23 09:05:57 2025 +0000
remove unnecessary features
commit 9758750801c7fd5a80f654eb982f43b87d674fa3
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 22 08:04:05 2025 +0000
tempsave. asynccopy+trload sanity checked
commit 1c4c04d725047357224ebf8a2b94d9010a5651a6
Author: aska-0096 <haocwang@amd.com >
Date: Mon Jul 21 05:55:55 2025 +0000
tempsave, trload+asyncload done
commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 10:04:34 2025 +0000
compile pass
commit d41b5eace939909084d32281710fb81142ad5fec
Merge: 3f86a81ee 33204e15f
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 05:17:27 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 3f86a81eee75256a78df02032d50814aaa42b038
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 05:16:39 2025 +0000
tempsave
commit 7d43f7446a9a20773f70e08462393f6c9afb7280
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 10:06:09 2025 +0000
temp save
commit 727629cd9115f1be9c1800bb65a8ea84ff06c250
Merge: aa5da19c9 94bceebc9
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 07:24:32 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
commit 94bceebc96ef4885e0ac861b7793e2e2897481bd
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 03:10:46 2025 +0000
move test_copy into test
commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:41:31 2025 +0000
remove unnecessary output
commit b1dbcacb1832560c6cc967a079dffce558228f0b
Merge: 5b0d311e6 0eaf3325a
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:26:13 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit 5b0d311e649257557a7014c28fcfac0c327b77b5
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:26:10 2025 +0000
add input validation and bug fix
commit 0eaf3325a8e019402ff12a2402f446f8471f584f
Merge: a66e1d29a 08c5df68a
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Date: Wed Jul 16 11:23:57 2025 -0700
Merge branch 'develop' into test_copy_fix
commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 08:55:50 2025 +0000
fix vmcnt shift
commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 08:37:07 2025 +0000
Improve s_waitcnt_imm calculation
commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 05:39:50 2025 +0000
fix the s_waitcnt_imm calculation
commit 1c0870089a0e7c78ed71a278bf52d98fc780e482
Merge: d6ee05e36 d9de58c66
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 03:57:57 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 03:54:33 2025 +0000
Add block_sync_lds_direct_load utility
commit c037a72040217471f52ee76bed9c07bf5b22aef4
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 15 09:39:03 2025 +0000
fix async copytest bug
commit aa5da19c94022449b027e7a57668f2e219f0f171
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 10 04:29:33 2025 +0000
temp save, change all instance to 1wave
commit ddd172feb9eb2cb783420a8db6f44d51b350c370
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 8 08:37:20 2025 +0000
tempsave, fmha_decode
commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33
Author: aska-0096 <haocwang@amd.com >
Date: Sat Jun 21 15:02:57 2025 +0000
temp save, waiting for debug
commit 71dd31f15bca01995c8cb0be9e903103f4657181
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jun 19 05:11:52 2025 +0000
save an example for __bf16 type
commit cdf33e079fa7d7d5b03b06550df2356b02041d7b
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jun 18 07:27:24 2025 +0000
fix bwd code
commit d630998dc6751f44097b1e9a239bb5063a793736
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jun 18 06:37:16 2025 +0000
Fix for fwd/bwd kernel build filter
commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com >
Date: Tue Aug 19 08:13:18 2025 +0000
Add restrict to avoid unnecessary vmcnt
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
* Add comments for c-stype cast
* Better comments
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
[ROCm/composable_kernel commit: de61e55493 ]
2025-08-25 20:55:12 +08:00
John Shumway
1ec340cc3b
Remove unsupported use of c++20 concept. ( #2719 )
...
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.
[ROCm/composable_kernel commit: c71d7ddd74 ]
2025-08-24 21:29:23 -07:00
Po Yen Chen
a2b87dd39d
[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() ( #2549 )
...
* Support async load dwordx4
* Enlarge load size on gfx950
[ROCm/composable_kernel commit: 4a7ecce096 ]
2025-08-22 10:13:47 +08:00
Yi DING
d732e1e379
[CK_TILE] FMHA BWD Fix Compilation with Bias ( #2682 )
...
* [CK_TILE] FMHA BWD Fix Compilation with Bias
* Fix appendkv kApplyRoPE
[ROCm/composable_kernel commit: 4cfa2c7158 ]
2025-08-22 10:01:10 +08:00
linqunAMD
807f7510b5
Support Wave32 in CK_TILE - Part 1 ( #2594 )
...
* Support wave32/wave64 in CK_TILE - Part 1
* remove blocksize in kernel launch
* fix build error
* fix clang format
* fix clang format 2
* fix clang format 3
* fix fmha build error
* fix fmha build 2
* fix fmha build 3
* fix build error 4
* address review comment
* update change log
* replace KernelBlockSize with kBlockSize
* fix CI fail
* fix clang format
* address review comment and rebase code.
* fix universal test fail
---------
Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: 9fcc1ee9fd ]
2025-08-18 10:08:31 -07:00
Tianyuan Wu
ec7ee5b7b7
[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 ( #2466 )
...
* WMMA GEMM F16 Implementation
Signed-off-by: root <tianyuwu@amd.com >
* Self-review
Signed-off-by: root <tianyuwu@amd.com >
* ASIC check minor tweak
Signed-off-by: root <tianyuwu@amd.com >
* add missing include file
* Set GPU_TARGETS to gfx11/12 generic
Signed-off-by: root <tianyuwu@amd.com >
* INT8 GFX12
Signed-off-by: root <tianyuwu@amd.com >
* add int8x16 branch
* Fix CI script
Signed-off-by: root <tianyuwu@amd.com >
* Fix typo
Signed-off-by: root <tianyuwu@amd.com >
* Add CK_Tile WMMA example
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Fix CI
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* fix clang format
* Set M/N_Warp Back to Constant
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Use GemmConfigComputeV3 by default
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Remove CK_Tile wmma gemm examples from the CI list
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add atomic add fallback method for gfx11
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Omit copyright year
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Support non-square cases
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add get_device_ip()
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Add atomic add fallback method for gfx11"
This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746.
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"
This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5.
* Revise method name and typos
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Try fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Try fix CI"
This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902.
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo caused by merge
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Fix typo caused by merging
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
---------
Signed-off-by: root <tianyuwu@amd.com >
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 68134b60e4 ]
2025-08-15 16:22:27 -07:00
SamiAario-AMD
a2e850acbc
Cleanups ( #2631 )
...
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
[ROCm/composable_kernel commit: 28a97865f5 ]
2025-08-13 10:12:08 +02:00
Haocong WANG
cfd8e1b303
Re-enable optimization for gfx950 fmha fwd ( #2671 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
* refactor blockgemm change, isolate to v2;
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 05a6e92705 ]
2025-08-13 14:57:43 +08:00
asleepzzz
9161cb590d
Revert "Optimize fmha fwd decode & prefill for gfx950 ( #2641 )" ( #2670 )
...
This reverts commit 327bf408dd05b4e4bfb7b72f63f8710f35efa9a4.
[ROCm/composable_kernel commit: 5b39de4bb6 ]
2025-08-12 20:27:10 +08:00
Haocong WANG
8a20b06f54
Optimize fmha fwd decode & prefill for gfx950 ( #2641 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
[ROCm/composable_kernel commit: b7322a521a ]
2025-08-12 19:43:14 +08:00
Yi DING
0afd7af89c
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
[ROCm/composable_kernel commit: 8e1eb0c1ee ]
2025-08-12 17:02:52 +08:00
Yi DING
ef32abc4cc
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
[ROCm/composable_kernel commit: 4fde1646e5 ]
2025-08-12 11:11:55 +08:00
Yi DING
84a17e2f94
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
[ROCm/composable_kernel commit: b0a97498b0 ]
2025-08-07 21:24:43 +08:00
Illia Silin
6704e5303f
fix the clang-format ( #2578 )
...
[ROCm/composable_kernel commit: 49723e94bb ]
2025-07-28 20:49:55 -07:00
Yi DING
a592107cb9
[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 ( #2130 )
...
* Fix shuffle_tile
* Add fmha bwd d160
* CHANGELOG
* Use static_cast
* Update
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 1926cd0cb8 ]
2025-07-29 09:31:14 +08:00
Illia Silin
3345f5f417
upgrade from clang-format-12 to clang-format-18 ( #2568 )
...
* upgrade to clang-format-18
* update to clang-format-18 in pre-commit-config
[ROCm/composable_kernel commit: 504b101da3 ]
2025-07-28 11:34:07 -07:00
shay-li77
d143f1466f
fix mha bwd dbias random mismatch ( #2570 )
...
* fix mha bwd dbias random mismatch
* formatting code
[ROCm/composable_kernel commit: 8ae528a1b4 ]
2025-07-28 14:39:31 +08:00
liang
d055927b83
reorder grid dim schedule ( #2533 )
...
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: d2459878cf ]
2025-07-26 02:46:55 +08:00
Po Yen Chen
e75bbb6a13
Revert "Eliminate warning caused by failed to meet occupancy requirement ( #2389 )" ( #2514 )
...
This reverts commit c5014a86765c598215bf2b58f89c3a7c70f92bac.
[ROCm/composable_kernel commit: 722c22fb15 ]
2025-07-17 10:09:01 +08:00
Qianfeng
337126469c
Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) ( #2487 )
...
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com >
[ROCm/composable_kernel commit: 45904b8fd7 ]
2025-07-11 18:14:47 +08:00
shay-li77
4f08a02dae
support y-direction step length greater than 1 for SimplifiedGenericAttentionMask ( #2338 )
...
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
[ROCm/composable_kernel commit: d814fefe18 ]
2025-07-09 23:18:55 +08:00
Haocong WANG
7c04d93083
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
[ROCm/composable_kernel commit: 5557eadce6 ]
2025-07-08 10:42:07 +08:00
Po Yen Chen
a71dc1245f
Eliminate warning caused by failed to meet occupancy requirement ( #2389 )
...
Co-authored-by: felix <felix.li@amd.com >
[ROCm/composable_kernel commit: b2dea90116 ]
2025-07-08 09:17:25 +08:00
ltqin
ba133fe9b7
ck tile pagedkv prefill ( #2405 )
...
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: 9f4c5d7372 ]
2025-07-07 16:16:54 +08:00
Po Yen Chen
b62e551ccb
[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface ( #2403 )
...
* Rename batch_prerfill interface
* Add min_seqlen_q parameter in MakeKargs()
[ROCm/composable_kernel commit: 50fad03524 ]
2025-06-25 15:19:21 +08:00
Yi DING
cba904aeff
[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 ( #2114 )
...
* 160+192
* Add splitkv d160
* cleanup
* fix
* Add change log
* Fix CHANGELOG
* Use static_cast
* Update ignored instance
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: b8212864cf ]
2025-06-24 01:33:31 +08:00
Max Podkorytov
eb96164495
Update for xformers ( #2372 )
...
* update api
* update kernel api
* clang-format
[ROCm/composable_kernel commit: 0366fb2abc ]
2025-06-22 00:28:30 -07:00
Satyanvesh Dittakavi
a4517b0a9d
Do not use warpSize as compile time constant as it is removed ( #2320 )
...
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com >
[ROCm/composable_kernel commit: 4c57157d50 ]
2025-06-17 11:54:30 -07:00
MHYangAMD
7cb7aa8e75
Fix fmha fwd precision issue on MI3XX series ( #2285 )
...
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
[ROCm/composable_kernel commit: 9fcf21a4ec ]
2025-06-10 15:03:23 +08:00
Po Yen Chen
4fc501ba2b
[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask ( #2209 )
...
* Assign block indices reversely if kHasMask=true
* Assign block indices reversely for splitkv kernel
[ROCm/composable_kernel commit: c42b957d65 ]
2025-05-27 10:58:58 +08:00
Zzz9990
3991de1cdc
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit 5ba7148f7b34b1b438e9806748211c153ee8c433.
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
[ROCm/composable_kernel commit: ece38b9d7a ]
2025-05-26 19:17:18 +08:00
Po Yen Chen
2655746971
[CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations ( #2198 )
...
* Write soft-sign in inline asm
* Change tile idx computation
* Add macro to turn off soft-sign asm opt
* Use simple for loop to avoid register spill
* Only do block id transform for masking cases
[ROCm/composable_kernel commit: 791802b381 ]
2025-05-16 15:14:46 +08:00
Po Yen Chen
c3f74a83b3
[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines ( #2163 )
...
* hack for cap logits
* fix bug
* Re-format files
* Allow specifying logits_soft_cap through APIs
* Support turn on/off logits_soft_cap in async pipeline
* Do not generate non-verified kernels
* Align receipt used in Aiter
* Sync logits soft-capping across pipelines
* Re-enable some hdim pipelines
* fix perf
* Add attention variant for logits_soft_cap
* Add newline at end-of-file
* Fix performance
* Add comment to explain logits_soft_cap pre-processing
* Unify code
* Unify floating-point literal style
* Use class data member to slience the compilation error
* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133 )
* Send 'mask' along with variant params to the LogitsMask()
* Send block indices to the variant
* Add indices parameters in variant interface
* Fix fmha bwd codegen error
* Allow switch logits_soft_cap impl
* Eliminate register spills
* Fix compilation errors
* Fix wrong LSE
* Fix LSE for splitkv kernel
* Sync splitkv pipeline changes
* Add batch_prefill kernel/pipeline
* Fix codegen error
* Undo changes in CMakeLists.txt
* Merge pipeline filtering check
* Use different code path if kHasLogitsSoftCap=false
* Remove [[maybe_unused]] attribute
* Use pre-existing compile-time flag to instantiate templates
* Sync pipeline changes
* Update CHANGELOG.md
---------
Co-authored-by: Bernard <bernaliu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
[ROCm/composable_kernel commit: 2920604786 ]
2025-05-13 12:19:25 +08:00
slippedJim
cca9cca699
add fmha fwd splitkv receipt for aiter c++ api ( #2068 )
...
* add s_randval for c++ api
* Fix bug of bias in splitkv
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
[ROCm/composable_kernel commit: 5f885d2b7a ]
2025-04-10 23:21:13 +08:00
rocking
2b657d9a2c
Reduce redundant space in bias tensor ( #2024 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: 8a20b62e91 ]
2025-03-28 21:58:06 +08:00
rocking
8be61cfc9d
Sync the kname with instance name ( #1989 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: b819c217e4 ]
2025-03-20 00:06:45 +08:00
carlushuang
5293517d0a
Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" … ( #1971 )
...
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 )
This reverts commit a222573537b139c8a4f6b870910bdd487935efa8.
* fix codegen problem
* Update config.hpp
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: 3e81279d26 ]
2025-03-13 11:41:39 +08:00
Illia Silin
c62042b022
Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" ( #1969 )
...
This reverts commit 4399ad790293bb76e4cf89f333017cf06a6d3e82.
[ROCm/composable_kernel commit: 8cbcd3e0d0 ]
2025-03-11 10:40:18 -07:00
carlushuang
569e9892a6
[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )
...
* support hdim=192/128 pair
* remove useless print
* update
[ROCm/composable_kernel commit: 7a93b16ff6 ]
2025-03-11 21:07:40 +08:00
Qianfeng
7771db8ecf
Ck tile/complete k prefetch ( #1941 )
...
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce
* Remove last block_sync_lds() in the loop
* Tiny adjustment in qr_ks_vs_async pipeline for better performance
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline
* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline
* Use un-rolled gemm for Gemm-0
* Use k0_loops small tile load/store to replace the big tile load/store for K
* Remove the commented lines in qx_ks_vs_custom_policy.hpp
* Tune the prefetching of V in qr_ks_vs_async pipeline
* Move the codes for storing the first v_lds tile some later
* Let BlockDropout reuse LDS with V
* Switch to separate code blocks according to iteration index
* Interleave code blocks for better performance
* Move clear_tile(s_acc) for better interleaving
* Move code interleaving
* Use MakeQDramTileDistribution for q_dram_window
* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop
* Let V reuse the LDS of K
* Use array of tiles to represent Q in vgprs
* Use QLoadOnce == false for qr_ks_vs_async pipeline
* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline
* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()
* Move the definition of v_tiles out from the loop
* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()
* Remove using KLoadOnce in qx_ks_vs_custom_policy
* Remove un-used get_slice_tile() call
* Move the code line of clear_tile(s_acc)
* Tune the lines of codes to make them more tidy
* Re-arrange the codes before the main-loop
* Add comments
* Unify the alignment to be 8 for Q/K/V Lds decriptors
* Tuning to K pre-loading
* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false
* Adjust the pipeline codes
* Use NumPrefetchV to separate from NumVLdsBuffers
* Tune the location of a scheduler barrier code line
* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths
* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel
* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline
* Move the location for store_tile() of first v_tile
* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline
* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>
* Try to fix old bugs in qx_ks_vs_custom_policy
* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler
* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy
* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)
* Tiny Fix (#1888 )
* Ck tile/paged attention workaround (#1894 )
* Correction in GetRangeAlongX()
* Work-around to solve the failures in test_paged_attention_ck in xformers
* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline
* Remove one call of move_tile_window for q_dram_window
* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()
* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline
* Adjust the location of codes for storing the first V tile to LDS
* Tiny fix and add comments
* Change GetSmemKPackK size to improve performance
* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy
* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access
* Adjustment in intermediate iteration codes for tiny performance improvement
* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation
* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue
* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()
* Remove useless AsyncopyV
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used
* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy
* Recover the qr_ks_vs_async pipeline
* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline
* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"
This reverts commit 39b82ca194 .
* Tiny fix with regard to whole_k_prefetch pipeline compiling
* Update kPadSeqLenK setting in fmha_fwd_kernel
* Use q_element_func and k_element_func
* Use single q_tile rather than multiple sliced q_tiles
* Codes refine according to the comments
* Re-format one file
* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true
[ROCm/composable_kernel commit: 4f54fa3058 ]
2025-03-07 14:19:51 +08:00
Qianfeng
d12c6417a0
Ck tile/paged attention workaround ( #1894 )
...
* Correction in GetRangeAlongX()
* Work-around to solve the failures in test_paged_attention_ck in xformers
[ROCm/composable_kernel commit: a3757a5f9c ]
2025-02-17 14:29:25 +08:00
Qianfeng
cec8025ec4
Tiny Fix ( #1888 )
...
[ROCm/composable_kernel commit: 4cfb24feb6 ]
2025-02-14 12:44:32 +08:00
Qianfeng
3cc02417a9
Update for fmha_fwd qs_ks_vs pipeline ( #1810 )
...
* Update for fmha_fwd qs_ks_vs pipeline
* Remove _builtin_amdgcn_sched_barrier(0)
* Move p_compute to p converting earlier for trying to increase vgprs re-using
* Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation
* Re-add __builtin_amdgcn_sched_barrier(0)
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: 3d50f57f43 ]
2025-01-13 12:43:05 +08:00
Max Podkorytov
4c98908e17
mark unused args
...
[ROCm/composable_kernel commit: ad697c78ac ]
2025-01-08 10:09:54 -08:00
Max Podkorytov
d7a2a81051
run clang-format -style=file
...
[ROCm/composable_kernel commit: a2e6ad62e2 ]
2025-01-08 10:09:54 -08:00
Max Podkorytov
e1896982b5
run clang-format==12
...
[ROCm/composable_kernel commit: aa59ecaa22 ]
2025-01-08 10:09:54 -08:00
Max Podkorytov
715635839b
update comment in the policy
...
[ROCm/composable_kernel commit: 82fb3f84fb ]
2025-01-08 10:09:54 -08:00
Max Podkorytov
00c32ecda2
update qsksvs comment
...
[ROCm/composable_kernel commit: 4daa82b451 ]
2025-01-08 10:09:54 -08:00
Max Podkorytov
099e23be84
remove dead code
...
[ROCm/composable_kernel commit: 66c5b715c9 ]
2025-01-08 10:09:54 -08:00
Max Podkorytov
63cc962000
clang-format and remove dead code
...
[ROCm/composable_kernel commit: edb78a4729 ]
2025-01-08 10:09:54 -08:00