Qianfeng Zhang
ddf0f1c8ed
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving
2025-12-25 14:30:57 +00:00
Qianfeng Zhang
02cae85af5
Load Q directly from global memory to registers for BlockGemm
2025-12-20 14:08:55 +00:00
Qianfeng Zhang
3d90b5f90e
Remove un-used including from default policy file
2025-12-19 10:13:41 +00:00
Qianfeng Zhang
9e47664092
Move common codes to detail namespace from Problem class scope
2025-12-17 10:37:21 +00:00
Qianfeng Zhang
89daa890d1
Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)
2025-12-17 07:47:17 +00:00
Qianfeng Zhang
1cf868026b
Add support of loading QK tiles of hdim96 without padding to hdim128
2025-12-16 16:39:40 +00:00
Qianfeng Zhang
588f573ee1
Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile
2025-12-16 15:03:57 +00:00
Qianfeng Zhang
179f0e857e
Rename WarpTile in fwd setting
2025-12-14 16:40:52 +00:00
Qianfeng Zhang
125934a966
Simplifying the codes in defining KDram and QDram tile distribution
2025-12-14 14:23:56 +00:00
Qianfeng Zhang
1ab5e9da93
Tiny update in GetMaxVectorSize()
2025-12-14 04:43:02 +00:00
Qianfeng Zhang
f79a29ac80
Rename and add scripts for testing hdim96
2025-12-12 16:16:43 +00:00
Qianfeng Zhang
b3d54477f1
Enable hdim96 instances
2025-12-12 16:16:23 +00:00
Qianfeng Zhang
18108d0d54
Fix with regard to define stride in MakeKLdsBlockDescriptor()
2025-12-12 09:55:53 +00:00
Qianfeng Zhang
db39b44bab
Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV
2025-12-11 10:47:54 +00:00
Qianfeng Zhang
8640ffe8eb
Further correction with regard to using n0_loops and k1_loops
2025-12-08 16:03:56 +00:00
Qianfeng Zhang
641dae10e8
Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning
2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857
Simplify the codes in block_gemm
2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519
Further clarification in using kSubQKHeaddim and kQKHeaddim
2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee
Clarify the using of kSubQKHeaddim and kQKHeaddim
2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a
Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines
2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be
Tiny fix in GetQKBlockGemm
2025-11-30 14:04:48 +00:00
Qianfeng Zhang
f01e0ef37d
Enable the using of WarpTile-32x32x16 and add scripts to verify
2025-11-30 04:58:28 +00:00
Qianfeng Zhang
d99493606e
Add static_assert and comments in the with_softmax pipelines
2025-11-28 15:19:33 +00:00
Qianfeng Zhang
f952d3571c
Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950
2025-11-28 14:02:16 +00:00
Qianfeng Zhang
a0e4315d4e
Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline
2025-11-27 15:30:53 +00:00
Qianfeng Zhang
69c97c06d7
Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md
2025-11-27 08:27:52 +00:00
Qianfeng Zhang
f9e8c5539f
Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window
2025-11-23 04:49:01 +00:00
Qianfeng Zhang
4f33eb5857
Merge branch 'develop' into hstu_attention_mi350_fwd_bwd
2025-11-23 04:20:53 +00:00
Emily Martins
02ab76c2cb
Fix CK Tile DP + 2 Tile Stream-K Validation Errors ( #3269 )
...
When there are multiple workgroups contributing to a tile, when using
atomics, there may be round off error in cases where the accumulator
type is not the same as the C type. To compute an error tolerance for
test validation, the Stream-K Tile Partitioner has a function called
estimate_num_wgs_per_tile to estimate the number of workgroups per tile.
That said, this function only provides an estimate. In some cases for
DP+2TSK, the function returns 1 rather than the more accurate value of
2.
Thus, this change updates the estimate_num_wgs_per_tile function to
explicitely return the value of 2 in cases for DP+2TSK to ensure that we
have a better error tolerance to avoid test failures due to round-off
error.
2025-11-21 20:29:47 -07:00
Illia Silin
21ae743acd
Enable daily builds on gfx1010 ( #3258 )
...
* add build/test on gfx1010
* only build and run on gfx1010 once daily
2025-11-21 07:22:01 -08:00
John Shumway
ea6e4fcbbc
Fix builder errors. ( #3260 )
...
There were four errors to fix:
1. The checks for defaulted direction were not implemented in the predicate concept.
2. Had to delete an obsolete and undefined operation enum.
3. A factory was passing a boolean in place of an integer.
4. Some of the factory tests are not compiling correctly when linking in the full source (with CK_EXPERIMENTAL_BUILDER=ON), so I commented them out.
2025-11-21 15:25:45 +01:00
John Shumway
f38c3de9f9
Fix copyright messages in experimental/builder. ( #3253 )
...
Our copyright were were mostly correct, but we inconsistently used (C) instead of (c) like the rest of the CK code. This PR fixes that (using lowercase c) and adds a missing copyright header to one file.
2025-11-20 17:40:55 -08:00
Aviral Goel
c8563f2101
chore(copyright): update copyright header for test directory ( #3252 )
...
* chore(copyright): update copyright header for test directory
* chore(copyright): update copyright header for test directory
* chore(copyright): update copyright header for client_example directory
* chore(copyright): update copyright header for test directory
2025-11-20 20:36:57 -05:00
Aviral Goel
a960c9950b
chore(copyright): update copyright header for cmake directory ( #3254 )
2025-11-20 20:36:37 -05:00
lalala-sh
f58bd56e6b
fix static assert ( #3178 )
...
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-11-20 17:27:05 -08:00
yinglu
4155eb24f9
fix:bf16x3:enable all instances on gfx950 ( #3248 )
...
* fix:bf16x3:enable all instances on gfx950
* fix clang-format fail
* fix clang-format fail
* fix:modified wrong params previously
2025-11-20 17:09:43 -08:00
spolifroni-amd
938b8ed3bf
Spolifroni amd/update changelog 711 ( #3211 )
...
* Update CHANGELOG.md with 7.1.1 information
* Update CHANGELOG.md
2025-11-20 10:51:18 -08:00
Yi DING
8b284a63a4
[CK_TILE] Refine FP32 => FP16/BF16 Conversion ( #3215 )
...
* [CK_TILE] Refine FP32 => FP16/BF16 Conversion
* Thank you Copilot
* Rename fix
* Fix example
* Fix accu checking
* Fix
* Fix
2025-11-20 10:50:26 -08:00
Gavin Zhao
07314ac543
Add support for RDNA1 GPUs ( #3220 )
...
* Allow compilation for RDNA1 (__gfx101__)
Signed-off-by: Gavin Zhao <git@gzgz.dev >
* More RDNA1 changes
Signed-off-by: Gavin Zhao <git@gzgz.dev >
* Even more RDNA1 changes
Signed-off-by: Gavin Zhao <git@gzgz.dev >
* cmake: skip build quantization for unsupported arches
* add gfx10-1-generic support as well
* add gfx1013 and complete gfx10-1-generic
* fix clang format
* enable DL kernels on gfx101x
---------
Signed-off-by: Gavin Zhao <git@gzgz.dev >
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-11-20 10:45:57 -08:00
Robin Voetter
bb155ef678
ck-builder: add remaining ck factory tests ( #3223 )
...
Now that the remaining reflection has been implemented, we
can add the remaining factory tests too. This is the complete set
of instances for forward grouped conv currently in CK.
2025-11-20 10:42:36 -08:00
Robin Voetter
245c6011cf
ck-builder: group transfer operations per tensor ( #3217 )
...
Grouping transfer operations per tensor makes it easier to
constrain on and operate with the transfer operations. As an
example, we can now deduplicate the logic for translating
the transfer operations from the ck-builder interface to the old
ck interface for the A and B tensors.
2025-11-20 10:40:48 -08:00
Aviral Goel
fb43760c66
chore(copyright): update copyright header for library directory ( #3239 )
2025-11-20 10:36:05 -08:00
Aviral Goel
7dfc46d73d
chore(copyright): update copyright header for test directory ( #3243 )
...
* chore(copyright): update copyright header for test directory
* chore(copyright): update copyright header for test directory
2025-11-20 10:33:34 -08:00
Emily Martins
2e4b8a8fc4
[CK_TILE] Remove Old CK Tile Stream-K Artifacts ( #3202 )
...
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed
Revert "Add attn sink ( #2892 )" ( #3250 )
...
This reverts commit 9fa4e8d5ab .
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab
Add attn sink ( #2892 )
...
* enable attn sink
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update attn_sink script
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix some error
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* clang-format
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd_kernel'mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update block_fmha_pipeline_qr_ks_vs.hpp
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix ci error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix format error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_bwd_pipeline_default_policy.hpp
* Update fmha_fwd_runner.hpp
* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* update splitkv_pipline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update splitkv&pagedkv pipeline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* add sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update attn_sink result log
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update smoke_test_fwd_sink.sh
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test file
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test script
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp
* use constexpr kHasSink for sink in fmha pipeline
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* update by pre-commit
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fmha_fwd.py
* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Remove causal mask setting logic from mask.hpp
Removed the mask setting logic for causal masks.
* fix ci error that some usage of lamada not support in c++17
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update remod.py
* add smoke sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update fmha_pagedkv_prefill.py
* Update FmhaFwdPipeline parameters in fmha_fwd.py
* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix c++17 unsupprot error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
* Fix formatting of sink_seq_end assignment
* Fix indentation for sink_seq_end assignment
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-20 19:24:05 +08:00
Illia Silin
84540edff3
fix typo ( #3244 )
2025-11-19 20:23:09 -08:00
Yi DING
47e2ed838e
[CK_TILE] Add Flatmm MX FP8 ( #3208 )
...
* Use async for flatmm mxfp4
* Fix preshuffle
* Add flatmm mxfp8
* Thanks, Copilot
* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
AviralGoelAMD
4e49e0228b
chore(copyright): update copyright header for test directory
2025-11-19 17:43:28 -07:00
linqunAMD
d2e32b4305
[ck_tile] enable test grouped_gemm_quant and gemm_streamk on gfx12 ( #3196 )
...
1. Enable grouped_gemm_quant and gemm_streamk on gfx12
- test_ck_tile_streamk_smoke is kept on gfx9, since it looks someone is still working on it.
2. Update warp tile size in grouped_gemm_quant and gemm_streamk unit test
3. Reduce gemm tile size to pass the build on gfx12 in test_gemm_streamk_reboot_types.hpp
2025-11-20 08:40:27 +08:00