Cong Ma
37daef0e83
Support transposed C tile in Aquant ( #2679 )
...
The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
[ROCm/composable_kernel commit: 428090f749 ]
2025-08-28 13:28:09 -07:00
Mateusz Ozga
e4010d5ea1
[CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D ( #2752 )
...
* Init commit
* Quick fix, CI fails
* Remove CDElementWise
* Add CDEELementWise
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: 0758883fa4 ]
2025-08-28 12:45:50 -07:00
asleepzzz
3db57fc348
Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )" ( #2757 )
...
This reverts commit f2e6edde3b .
[ROCm/composable_kernel commit: 038ea82315 ]
2025-08-28 22:50:42 +08:00
linqunAMD
4fab7b0a30
[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel ( #2728 )
...
* fix copy basic build error
* fix other ck tile test build error
[ROCm/composable_kernel commit: 4a49dac7c6 ]
2025-08-28 20:30:30 +08:00
Yi DING
f2e6edde3b
[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )
...
* 16x192
* Use buffer_load_lds for lse/d
* Dispatch & cleanup
* Avoid zeroing dq & fix
* fix
[ROCm/composable_kernel commit: ead4447b20 ]
2025-08-28 18:54:18 +08:00
Linjun-AMD
56395b127c
use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline ( #2711 )
...
* add k_lds padding and iglp to improve dim256 fmha fwd
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* update block_fmha_pipeline_qr_ks_vs.hpp
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp
* clang format
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* use same naming style
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
[ROCm/composable_kernel commit: bf7b458e6e ]
2025-08-28 11:39:39 +08:00
Aviral Goel
8b259a9f20
feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types ( #2691 )
...
* feat(check_err): add a variable to adjust number of incorrect values to print
* feat(host_tensor): add printing capability for fp8 bf8 int8 int4
* fix(gemm_utils): update acceptable data type
* fix(host_tensor): print both 4 bit ints in pk_int4_t
* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp
* feat(host_tensor): add print first n elements functions
[ROCm/composable_kernel commit: f5f795c4d6 ]
2025-08-27 18:17:24 -07:00
Cong Ma
871f68f227
[CK TILE GEMM] Fix a merge conflict ( #2753 )
...
* Fixed a merge conflict in 39b254c1
* Foramt the code
[ROCm/composable_kernel commit: cd53e2e57e ]
2025-08-27 11:08:09 -07:00
Bartłomiej Kocot
4a4df2657a
Fix splitk autodeduce for grouped conv bwd weight ( #2742 )
...
[ROCm/composable_kernel commit: cfe5e448db ]
2025-08-27 12:35:42 +02:00
Cong Ma
39b254c126
[CK TILE] Fix bugs in AQuant preshuffle ( #2700 )
...
* [CK TILE] Fix bugs in AQuant preshuffle
- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.
* [CK TILE] Rename Preshuffle to PreshuffleQuant
The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.
* [CK TILE Block Scale] Use GemmConfig to save tile properties
- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
coded tile properties in `gemm_calc_aquant()`
* [CK TILE Block Scale] Rename GemmConfig used in block scale
- Remove unused GemmConfig
- Rename GemmConfig used in block scale
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com >
[ROCm/composable_kernel commit: 245467f359 ]
2025-08-27 00:05:54 -07:00
linqunAMD
fb6848095b
Fix merge mfma_wmma (part 1) regression ( #2749 )
...
root cause: a typo in GetGfx11InputBlkIdx, const ia added by mistake.
[ROCm/composable_kernel commit: 95e4a4efcb ]
2025-08-26 22:49:34 -07:00
John Afaganis
cef79d5f82
Revert "[CK-TILE] Default epilogue, adding support for D ( #2629 )" ( #2746 )
...
This reverts commit 92037686ae .
[ROCm/composable_kernel commit: 508e7912f9 ]
2025-08-26 09:48:49 -07:00
Mateusz Ozga
92037686ae
[CK-TILE] Default epilogue, adding support for D ( #2629 )
...
* Extend 2d-epilogue, D support
* Added tests & update
* Remove unused attribute
* Extend tests
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: d43228fbca ]
2025-08-25 19:29:35 -07:00
Yi DING
1f35e4c5b7
[CK_TILE] FMHA avoid unnecessary vmcnt0 ( #2715 )
...
* FMHA avoid unnecessary vmcnt0
Squashed commit of the following:
commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 22 03:15:51 2025 +0000
merge develop and solve conflicts
commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c
Merge: 2dac61a4f 5d56dde0e
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 22 03:15:21 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue
commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140
Author: Ding, Yi <yi.ding@amd.com >
Date: Tue Aug 19 02:17:43 2025 +0000
update bwd
commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682
Author: Kevin Choi <kevin.choi@amd.com >
Date: Mon Aug 18 19:36:38 2025 +0000
add restrict to applicable functions
commit 45534dee5bcbe532da46fc5cd6601cde10d84387
Author: Ding, Yi <yi.ding@amd.com >
Date: Mon Aug 18 02:07:03 2025 +0000
bwd filter
commit 7abd7b372b82cba94a457238b6b4a81d093e7280
Author: Kevin Choi <kevin.choi@amd.com >
Date: Sat Aug 16 08:15:23 2025 +0000
remove noinline attr as it causes a lot more s_waitcnt's
commit 89c29746a09255c1d26038171157e91d1b68d14a
Author: Kevin Choi <kevin.choi@amd.com >
Date: Thu Aug 14 12:11:17 2025 +0000
remove innerloop, move restrict parameters to mainloop and add noinline attribute.
commit 6f61b3a5c80011411aa3aebf7983602f7c117566
Author: Kevin Choi <kevin.choi@amd.com >
Date: Thu Aug 14 07:06:51 2025 +0000
Create inner lambda with restrict parameters, add restrict to some parameters
commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a
Author: aska-0096 <haocwang@amd.com >
Date: Thu Aug 14 03:43:54 2025 +0000
save for debug
commit 5f2c3cfa86c6951208a1cc227fa556704a885a88
Merge: 25f067b4f 165a2723c
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 13 02:15:22 2025 +0000
Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 25f067b4f09d6909a05e252c7621124046dfda57
Merge: 447c1c5d6 bd3b4afb9
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 13 02:14:26 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 165a2723c557420b48891cc1ce3434e3675aef5d
Merge: 447c1c5d6 be39c00a8
Author: asleepzzz <hanwen.chang@amd.com >
Date: Wed Aug 13 00:34:11 2025 +0800
Merge branch 'develop' into wip-async-tr-fa
commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 14:25:50 2025 +0000
refactor blockgemm change, isolate to v2;
commit 8f67083511ff77d31c880f4427d3bdf53a179568
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 09:26:13 2025 +0000
clang format
commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965
Merge: c84662c34 19ef22e56
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 09:04:41 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit c84662c345755ec5f3d524fdde4aa951c8f86298
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 08:46:06 2025 +0000
Fix the bug
commit e0647ffa5646f8132529b152af02750c4010013d
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 04:02:41 2025 +0000
fix conflict. disable all v-col instance for fmha fwd
commit 781f98236c376f57591a6d481cc2ee04b36a148b
Merge: 241f3d7dc 8cb8da53c
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 03:52:34 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 01:53:31 2025 +0000
clang format
commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f
Merge: 1a629c098 eda4a5e80
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 12 01:52:52 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 1a629c09876cc05f0750db7eade1d527dc32a1d3
Merge: f65874e5b 92c0435e2
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 15:59:40 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit f65874e5b07579d5b734b4c68877679a3ee04dac
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 15:37:37 2025 +0000
change the warp setting for hdim32 fmha fwd
commit 7c5f5e65e97486c074ef9a138900ed9aafea547e
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 11 14:21:09 2025 +0000
tempsave, update the blocksync functions
commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915
Author: aska-0096 <haocwang@amd.com >
Date: Sun Aug 10 06:00:51 2025 +0000
fix bug in pki4
commit 073db2e18af21f1ed1fb3d1f1c15830838df986f
Author: aska-0096 <haocwang@amd.com >
Date: Sat Aug 9 03:25:12 2025 +0000
fix bugs in gemm
commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 18:35:53 2025 +0000
fix bug on non-gfx950
commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 17:53:19 2025 +0000
fix bug
commit 30de97f473685e0bd5b82f15eee2493d9a05cffd
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 15:42:15 2025 +0000
fix bugs
commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:31:01 2025 +0000
fix clangformat with 18.1.3
commit e4cb185c41586d018771a5413efd909d8d53a8c5
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:07:40 2025 +0000
remove non-necessary change
commit 498f0d44cfba17287cce8d10855cce5c5de263db
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 09:04:02 2025 +0000
bug fix, clang format;
commit 3cb648cbc4883e6889340d85f48d803a21b9c805
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 08:08:03 2025 +0000
Remove unnecessary changes
commit 9e7ff3b611b7933b65973907a0cae312a15d31c6
Merge: a3c1bfe6d f4247d199
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 07:50:12 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 06:19:31 2025 +0000
remove unnecessary files; rename some files
commit 6c257fa27729c005d539b5b71deeba3703031089
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 8 05:46:18 2025 +0000
merge fa_decode pipeline into fmha_fwd api
commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449
Author: aska-0096 <haocwang@amd.com >
Date: Wed Aug 6 05:58:43 2025 +0000
add __restrict__ to tr load
commit bbad2b979b701533b74f43452ffe0f775e019139
Author: aska-0096 <haocwang@amd.com >
Date: Tue Aug 5 07:23:51 2025 +0000
Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
commit d7fabd5f765e2a573ddbaf0857ce6f691407e562
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 10:27:42 2025 +0000
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 10:02:17 2025 +0000
add vmcnt guard before load ktile
commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c
Author: aska-0096 <haocwang@amd.com >
Date: Mon Aug 4 06:49:01 2025 +0000
Load Q through lds, implement xor;
commit 62bb9f05177dfb8280d6c2be67a88492d6be4838
Author: aska-0096 <haocwang@amd.com >
Date: Fri Aug 1 10:44:54 2025 +0000
small refactor
commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 31 10:25:37 2025 +0000
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 31 05:13:27 2025 +0000
enable larger tile size; upgrade xor pattern
commit a468e59a01d6dd85c105ca30ac249491256c5915
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 30 12:25:33 2025 +0000
remove all lds bankconflict with xor layouts
commit 39ff55cdc377311112100fb24bc013adfd8960c0
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 30 03:51:06 2025 +0000
enable prefill overload operator().
commit a7b152a788e8035c93f8e4cbf317863182665d8f
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 25 07:10:01 2025 +0000
fix the lds alignment caused performance regression
commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 23 09:05:57 2025 +0000
remove unnecessary features
commit 9758750801c7fd5a80f654eb982f43b87d674fa3
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 22 08:04:05 2025 +0000
tempsave. asynccopy+trload sanity checked
commit 1c4c04d725047357224ebf8a2b94d9010a5651a6
Author: aska-0096 <haocwang@amd.com >
Date: Mon Jul 21 05:55:55 2025 +0000
tempsave, trload+asyncload done
commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 10:04:34 2025 +0000
compile pass
commit d41b5eace939909084d32281710fb81142ad5fec
Merge: 3f86a81ee 8c3766f0d
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 05:17:27 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 3f86a81eee75256a78df02032d50814aaa42b038
Author: aska-0096 <haocwang@amd.com >
Date: Fri Jul 18 05:16:39 2025 +0000
tempsave
commit 7d43f7446a9a20773f70e08462393f6c9afb7280
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 10:06:09 2025 +0000
temp save
commit 727629cd9115f1be9c1800bb65a8ea84ff06c250
Merge: aa5da19c9 94bceebc9
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 07:24:32 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
commit 94bceebc96ef4885e0ac861b7793e2e2897481bd
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 03:10:46 2025 +0000
move test_copy into test
commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:41:31 2025 +0000
remove unnecessary output
commit b1dbcacb1832560c6cc967a079dffce558228f0b
Merge: 5b0d311e6 0eaf3325a
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:26:13 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit 5b0d311e649257557a7014c28fcfac0c327b77b5
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 17 02:26:10 2025 +0000
add input validation and bug fix
commit 0eaf3325a8e019402ff12a2402f446f8471f584f
Merge: a66e1d29a f77d70498
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Date: Wed Jul 16 11:23:57 2025 -0700
Merge branch 'develop' into test_copy_fix
commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 08:55:50 2025 +0000
fix vmcnt shift
commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 08:37:07 2025 +0000
Improve s_waitcnt_imm calculation
commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 05:39:50 2025 +0000
fix the s_waitcnt_imm calculation
commit 1c0870089a0e7c78ed71a278bf52d98fc780e482
Merge: d6ee05e36 92ada43ba
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 03:57:57 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jul 16 03:54:33 2025 +0000
Add block_sync_lds_direct_load utility
commit c037a72040217471f52ee76bed9c07bf5b22aef4
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 15 09:39:03 2025 +0000
fix async copytest bug
commit aa5da19c94022449b027e7a57668f2e219f0f171
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jul 10 04:29:33 2025 +0000
temp save, change all instance to 1wave
commit ddd172feb9eb2cb783420a8db6f44d51b350c370
Author: aska-0096 <haocwang@amd.com >
Date: Tue Jul 8 08:37:20 2025 +0000
tempsave, fmha_decode
commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33
Author: aska-0096 <haocwang@amd.com >
Date: Sat Jun 21 15:02:57 2025 +0000
temp save, waiting for debug
commit 71dd31f15bca01995c8cb0be9e903103f4657181
Author: aska-0096 <haocwang@amd.com >
Date: Thu Jun 19 05:11:52 2025 +0000
save an example for __bf16 type
commit cdf33e079fa7d7d5b03b06550df2356b02041d7b
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jun 18 07:27:24 2025 +0000
fix bwd code
commit d630998dc6751f44097b1e9a239bb5063a793736
Author: aska-0096 <haocwang@amd.com >
Date: Wed Jun 18 06:37:16 2025 +0000
Fix for fwd/bwd kernel build filter
commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com >
Date: Tue Aug 19 08:13:18 2025 +0000
Add restrict to avoid unnecessary vmcnt
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
* Add comments for c-stype cast
* Better comments
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
[ROCm/composable_kernel commit: de61e55493 ]
2025-08-25 20:55:12 +08:00
John Shumway
1c8519a03b
Remove unsupported use of c++20 concept. ( #2719 )
...
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.
[ROCm/composable_kernel commit: c71d7ddd74 ]
2025-08-24 21:29:23 -07:00
linqunAMD
4d98c232fc
Extend XDL kernel to Support RDNA3/4 - Part 1 ( #2606 )
...
[ROCm/composable_kernel commit: d6e49c5fde ]
2025-08-22 17:46:30 -04:00
Po Yen Chen
5d56dde0e0
[CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values ( #2669 )
...
* Allow return VGPR get_warp_id() value
* Avoid using SALU in async_load_raw()
[ROCm/composable_kernel commit: 0db21053e6 ]
2025-08-22 10:17:05 +08:00
Po Yen Chen
b87e256c42
[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() ( #2549 )
...
* Support async load dwordx4
* Enlarge load size on gfx950
[ROCm/composable_kernel commit: 4a7ecce096 ]
2025-08-22 10:13:47 +08:00
Yi DING
f885dd8b33
[CK_TILE] FMHA BWD Fix Compilation with Bias ( #2682 )
...
* [CK_TILE] FMHA BWD Fix Compilation with Bias
* Fix appendkv kApplyRoPE
[ROCm/composable_kernel commit: 4cfa2c7158 ]
2025-08-22 10:01:10 +08:00
Bartłomiej Kocot
092ba92f26
[CK Tile] Grouped convolution backward data ( #2652 )
...
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com >
[ROCm/composable_kernel commit: 4212bbc170 ]
2025-08-20 05:29:57 -07:00
Emily Martins
aa964be163
[CK Tile] Stream K GEMM Kernel HostArgs and Kernel Classes ( #2681 )
...
* CK Tile Stream K Device Ops
Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The
StreamKKernel class injects Universal Gemm and includes functions to
facilitate kernel preparation for the GPU.
* Stream K Device Ops Fixes
- Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to
ensure we get size needed for accumulation buffers and semaphores.
- Pass in num_sk_blocks into TilePartitioner constructor
- Update documentation
* Add WarpTile dimensions to GetName function in StreamKKernel class
* Fix typos in StreamKHostArgs class description.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com >
* Apply clang format on updated comment for StreamKHostArgs
* Explicitly specify type for StreamKReductionStrategy enum
* Remove unecessary scopes
* Unify the commenting style to inline comments
* Add explicit casts for occupancy and num_cu in MakeKernelArgs function
Both the static functions Occupancy and NumCU in the StreamKKernel class
use functions from the HIP API that result in the returned occupancy and
num_cu types being type int. The TilePartitioner interface for stream K will
have occupancy and num_cu being type ck_tile::index_t which is int32_t.
Thus, to be safe, this change ensures that both occupancy and num_cu are
cast to int32_t.
* Fix use of kentry due to interface update
PR #2594 updated the interface for the kentry function in
include/ck_tile/host/kernel_launch.hpp. As a result, the static function
Occupancy was updated to work correctly with the new interface.
PR #2594 also changed UniversalGemmKernel's KernelBlockSize static
variable to kBlockSize, so the StreamKKernel class was updated to
reflect this change.
* Switch type of num_sk_blocks from uint32_t to int32_t
This change switches the type of num_sk_blocks to type ck_tile::index_t
which is int32_t. This was done because parallel work for the CK Tile
StreamK TilePartitioner's constructor will have num_sk_blocks as
ck_tile::index_t. Thus, this change will help unify the interfaces to
avoid any type conversion errors.
---------
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com >
[ROCm/composable_kernel commit: 071165919f ]
2025-08-19 15:08:52 -06:00
jefyang1
8b37a2968f
Fix pk i4 v3 example test regression on gfx942 ( #2706 )
...
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: 6ba9289b26 ]
2025-08-19 09:58:28 -07:00
joyeamd
f9e76244d8
fix grouped gemm example when wave32 enabled ( #2707 )
...
1, delete some unused variables
2, fix BlockSize when wave32 enabled
[ROCm/composable_kernel commit: a1589a9667 ]
2025-08-19 16:20:43 +08:00
mirchen-amd
020c9139ec
Mirchen/gemm blockscale wp segfault fix ( #2638 )
...
* Add stride validation to prevent segfault in blockscale GEMM
* run clang-format
* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* added stride length checking to more gemm examples in ckprofiler
* ran clang format
* added validation header and implement in core gemm operations
* remove ck_tile transpose and gemm stages from CI (#2646 )
* update CK build instruction step 4 (#2563 )
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
* Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 )
* fix reduce2d
- revret the combine_partial_results() chnages
- remove auto from function def
* clang-format
* enable aiter test_mha in daily CI (#2659 )
* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 )
* feat(copy_kernel): add basic copy kernel example with documentation
* docs(CHANGELOG): Updated changelog
* chore: performed clang format
* Update example/ck_tile/39_copy/copy_basic.cpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* fix(terminology): follow amd terms
* extract elementwise copy to a new kernel
* fix(copy_kernel): bug in verification
* add comments about vgpr usage
* lint and nits
* add notes and comments
* print hostTensor via stream
* print hostTensor via stream
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628 )
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
* Fix num_byte calculations to use nhead_k for K & V size (#2653 )
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
* [CK_TILE] FMHA BWD Decode Pipeline (#2643 )
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
* fix (#2668 )
* Optimize fmha fwd decode & prefill for gfx950 (#2641 )
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 )
This reverts commit 747d127983 .
* added batch stride checking to batched gemm ops in profiler
* removed batch stride validation
* removed batched stride validation again
* Update include/ck/library/utility/profiler_validation_common.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* refactor function names
* added gemm stride checking to more profiler gemm operations
* run clang format
* add stride checkign to 01 gemm example
* rename from profiler to validation common, used for examples and profiler
* build of ckProfiler success
* update file headers
---------
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com >
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
Co-authored-by: Yi DING <yi.ding@amd.com >
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com >
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com >
Co-authored-by: Haocong WANG <haocwang@amd.com >
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 60320e90c1 ]
2025-08-19 01:19:17 -07:00
linqunAMD
615ca9842d
Support Wave32 in CK_TILE - Part 1 ( #2594 )
...
* Support wave32/wave64 in CK_TILE - Part 1
* remove blocksize in kernel launch
* fix build error
* fix clang format
* fix clang format 2
* fix clang format 3
* fix fmha build error
* fix fmha build 2
* fix fmha build 3
* fix build error 4
* address review comment
* update change log
* replace KernelBlockSize with kBlockSize
* fix CI fail
* fix clang format
* address review comment and rebase code.
* fix universal test fail
---------
Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: 9fcc1ee9fd ]
2025-08-18 10:08:31 -07:00
Sami Remes
13bfcba04c
Add other layouts for FP8 block scaled gemm ( #2665 )
...
* Start adding other layouts for gemm_ab_scale
* Add some instances
* Create tensor descriptors for A/B scales depending on A/B layout
* Fix formatting
* Revert some comments
* Revert commented instances in CMakeLists.txt
* Add some more instances for col-row gemm
* enable more row,row instances
* Use occupancy=1 for col,row layout to avoid spills
[ROCm/composable_kernel commit: 26d3300930 ]
2025-08-18 01:46:10 -07:00
Tianyuan Wu
a4d70b6e13
Fix CI build error ( #2695 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
[ROCm/composable_kernel commit: 7310830d14 ]
2025-08-18 01:45:40 -07:00
Tianyuan Wu
abb90422b4
[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 ( #2466 )
...
* WMMA GEMM F16 Implementation
Signed-off-by: root <tianyuwu@amd.com >
* Self-review
Signed-off-by: root <tianyuwu@amd.com >
* ASIC check minor tweak
Signed-off-by: root <tianyuwu@amd.com >
* add missing include file
* Set GPU_TARGETS to gfx11/12 generic
Signed-off-by: root <tianyuwu@amd.com >
* INT8 GFX12
Signed-off-by: root <tianyuwu@amd.com >
* add int8x16 branch
* Fix CI script
Signed-off-by: root <tianyuwu@amd.com >
* Fix typo
Signed-off-by: root <tianyuwu@amd.com >
* Add CK_Tile WMMA example
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Fix CI
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* fix clang format
* Set M/N_Warp Back to Constant
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Use GemmConfigComputeV3 by default
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Remove CK_Tile wmma gemm examples from the CI list
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add atomic add fallback method for gfx11
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Omit copyright year
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Support non-square cases
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add get_device_ip()
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Add atomic add fallback method for gfx11"
This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746.
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"
This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5.
* Revise method name and typos
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Try fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Try fix CI"
This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902.
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo caused by merge
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Fix typo caused by merging
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
---------
Signed-off-by: root <tianyuwu@amd.com >
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 68134b60e4 ]
2025-08-15 16:22:27 -07:00
jefyang1
1a412578d8
Add gemm universal f8 f8 bf16 instances on gfx950 ( #2662 )
...
[ROCm/composable_kernel commit: d7c95dd491 ]
2025-08-14 13:25:24 -07:00
Yashvardhan Agarwal
b13aa4217e
CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) ( #2632 )
...
* CK_TILE: Implement two-stage split-K GEMM with reduction
- Added split-K GEMM with reduction example
* comment resolutions
[ROCm/composable_kernel commit: 7f14772406 ]
2025-08-14 10:18:52 +02:00
Gino Lu
3bbad88a97
fix wrong nan producion. ( #2640 )
...
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: e5623d3825 ]
2025-08-14 15:12:31 +08:00
Aviral Goel
e371bb31f5
Minor Improvements in CK TILE memory copy EXAMPLE ( #2678 )
...
* Rename vector to ThreadTile
* more notes on tile encoding
* remove number<> from tuple of make_tile_window
* add script to stress test the copy example
[ROCm/composable_kernel commit: 8a698c7445 ]
2025-08-13 15:24:16 -07:00
joyeamd
d6347da784
[CK_TILE]fix elementwise example in gfx11/12 ( #2676 )
...
* fix elementwise examples
* improve the robust
* fix ck_tile's elementwise test
* update elementwise test
[ROCm/composable_kernel commit: bcc38deff7 ]
2025-08-13 15:21:46 -07:00
Enrico Degregori
ec1c249b63
Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) ( #2675 )
...
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
[ROCm/composable_kernel commit: a6f4029276 ]
2025-08-14 00:21:09 +02:00
SamiAario-AMD
8a32077ccd
Cleanups ( #2631 )
...
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
[ROCm/composable_kernel commit: 28a97865f5 ]
2025-08-13 10:12:08 +02:00
Haocong WANG
e3e8de3477
Re-enable optimization for gfx950 fmha fwd ( #2671 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
* refactor blockgemm change, isolate to v2;
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 05a6e92705 ]
2025-08-13 14:57:43 +08:00
Cong Ma
402b3f445b
Preshuffle AQ matrix in block scale gemm ( #2624 )
...
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com >
[ROCm/composable_kernel commit: 452791a3ba ]
2025-08-12 21:32:51 -07:00
Thrupti Raj Lakshmana Gowda
62fb072dbe
GEMM Multi D for CK Tile Engine ( #2660 )
...
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com >
[ROCm/composable_kernel commit: 3f57ec3d2d ]
2025-08-12 16:05:05 -07:00
joyeamd
60e654e2c4
[CK_TILE]fix ck_tile's moe_sorting example in gfx11 ( #2667 )
...
* fix ck_tile's moe_sorting example in gfx11
* fix clang format
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 0856b3f4a2 ]
2025-08-12 12:33:56 -07:00
asleepzzz
4730050182
Revert "Optimize fmha fwd decode & prefill for gfx950 ( #2641 )" ( #2670 )
...
This reverts commit 747d127983 .
[ROCm/composable_kernel commit: 5b39de4bb6 ]
2025-08-12 20:27:10 +08:00
Haocong WANG
747d127983
Optimize fmha fwd decode & prefill for gfx950 ( #2641 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
[ROCm/composable_kernel commit: b7322a521a ]
2025-08-12 19:43:14 +08:00
Yi DING
19ef22e567
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
[ROCm/composable_kernel commit: 8e1eb0c1ee ]
2025-08-12 17:02:52 +08:00
Yi DING
8cb8da53c9
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
[ROCm/composable_kernel commit: 4fde1646e5 ]
2025-08-12 11:11:55 +08:00
Yashvardhan Agarwal
92c0435e29
Fixes to "General 2D Reduction Kernel" ( #2535 ) ( #2656 )
...
* fix reduce2d
- revret the combine_partial_results() chnages
- remove auto from function def
* clang-format
[ROCm/composable_kernel commit: 191c62967b ]
2025-08-11 15:01:33 +02:00
Max Podkorytov
f4247d1996
[CK-tile] add more tests for batched transpose testing the rectangular block tile sizes ( #2634 )
...
* add failing tests
* swap out and reference
* add constraint assert to transpose input distribution
* test both pipelines with rectangular block tile
* print mismatched indices
* add a smaller failing test for old pipeline
* print grid and block
* fill output before operating on it
* swap m/n tile sizes and make one test pass
* add device syncs
* add one more flipped test case
* flip block tile at host arg init
* fix tiles for lds pipeline
* clang-format
* rename tests
* roll back error check
* remove device syncs
* reduce large test case's size
[ROCm/composable_kernel commit: ab26026835 ]
2025-08-07 16:51:53 -07:00
Gino Lu
ed2293a87b
Add e8m0 scaled convert into CK_TILE ( #2617 )
...
* first commit
* remove redundent code
* modify according to comments.
* fix type_convert error with scaled_type_convert
[ROCm/composable_kernel commit: 5d6d236b25 ]
2025-08-07 21:37:28 +08:00
Yi DING
6fdfcdb180
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
[ROCm/composable_kernel commit: b0a97498b0 ]
2025-08-07 21:24:43 +08:00
Sami Remes
1777ce3229
[CK_TILE] Enable printing more structures in CK-Tile ( #2443 )
...
* Add more printing to core cktile
* Revert other changes in static encoding pattern
* Refactor to using a free print() function
* Remove loops and print just the containers
* Print tuple with better formatting, fix sequence compilation
* Add some tests for print utility
* Add print utility header
* Print for static_encoding_pattern
* add buffer_view printing
* Align vector_traits
* Fix formatting
* Lower-case enum strings
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com >
* Remove empty comment lines
* Fix test with lower-case too
* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y
* Add test_print_common.hpp
* add print.hpp in core.hpp
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com >
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
[ROCm/composable_kernel commit: ffdee5e774 ]
2025-08-07 15:45:27 +03:00
Enrico Degregori
bdafbd7ca1
Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) ( #2610 )" ( #2637 )
...
This reverts commit 863c87b1e2 .
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
[ROCm/composable_kernel commit: 21e9983913 ]
2025-08-07 12:30:08 +02:00
Bartłomiej Kocot
14eef94b28
Grouped Convolution Forward Infer Bias Bnorm Activ ( #2621 )
...
* Grouped Convolution Forward Infer Bias Bnorm Activ
* 3d
[ROCm/composable_kernel commit: 5328b232b2 ]
2025-08-07 08:36:47 +02:00