Qianfeng Zhang
76da618c85
Enable run-time selection of MTile sizes according to the predicted CU utilization ratio
2026-03-20 05:36:44 +00:00
Qianfeng Zhang
302537c5a8
Update to support grouped mode hstu attention
2026-03-09 16:15:58 +00:00
Qianfeng Zhang
73d6e0eb67
Using in-place version of block_tile_reduce() so that using of m_local is avoided
2026-03-05 16:27:41 +00:00
Qianfeng Zhang
2be2c3cd11
Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()
2026-02-21 14:59:00 +00:00
Qianfeng Zhang
f2a555dac7
Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts
2026-02-09 15:55:13 +00:00
Qianfeng Zhang
6f8b9548b5
Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument
2026-02-09 09:02:17 +00:00
Qianfeng Zhang
bdfa0a74c2
Update to hstu masking to separate the implementation for cross-attention and self-attention
2026-02-08 16:00:47 +00:00
Qianfeng Zhang
0711f4f90a
Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention
2026-02-06 15:40:07 +00:00
Qianfeng Zhang
d169ed2194
Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950
2026-02-05 15:57:18 +00:00
Qianfeng Zhang
8af5e26717
Add softmax selection to two of the testing scripts
2026-02-05 15:27:15 +00:00
Qianfeng Zhang
0a8c5f523a
[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling
2026-02-02 15:59:38 +00:00
Qianfeng Zhang
c360e0cbc4
Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256
2026-01-30 10:02:31 +00:00
Qianfeng Zhang
749e83f2fd
Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q
2026-01-26 13:45:42 +00:00
Qianfeng Zhang
1d4d925ba3
Fix in K-LdsBuffer and V-LdsBuffer over-lap checking
2025-12-27 05:43:11 +00:00
Qianfeng Zhang
d2dadc22a7
Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop
2025-12-26 15:38:52 +00:00
Qianfeng Zhang
df902c6a06
Tiny fix in using v_tiles[] index
2025-12-25 15:37:22 +00:00
Qianfeng Zhang
2d53d67b6d
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving
2025-12-25 14:58:09 +00:00
Qianfeng Zhang
ddf0f1c8ed
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving
2025-12-25 14:30:57 +00:00
Qianfeng Zhang
02cae85af5
Load Q directly from global memory to registers for BlockGemm
2025-12-20 14:08:55 +00:00
Qianfeng Zhang
3d90b5f90e
Remove un-used including from default policy file
2025-12-19 10:13:41 +00:00
Qianfeng Zhang
9e47664092
Move common codes to detail namespace from Problem class scope
2025-12-17 10:37:21 +00:00
Qianfeng Zhang
89daa890d1
Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)
2025-12-17 07:47:17 +00:00
Qianfeng Zhang
1cf868026b
Add support of loading QK tiles of hdim96 without padding to hdim128
2025-12-16 16:39:40 +00:00
Qianfeng Zhang
588f573ee1
Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile
2025-12-16 15:03:57 +00:00
Qianfeng Zhang
179f0e857e
Rename WarpTile in fwd setting
2025-12-14 16:40:52 +00:00
Qianfeng Zhang
125934a966
Simplifying the codes in defining KDram and QDram tile distribution
2025-12-14 14:23:56 +00:00
Qianfeng Zhang
1ab5e9da93
Tiny update in GetMaxVectorSize()
2025-12-14 04:43:02 +00:00
Qianfeng Zhang
f79a29ac80
Rename and add scripts for testing hdim96
2025-12-12 16:16:43 +00:00
Qianfeng Zhang
b3d54477f1
Enable hdim96 instances
2025-12-12 16:16:23 +00:00
Qianfeng Zhang
18108d0d54
Fix with regard to define stride in MakeKLdsBlockDescriptor()
2025-12-12 09:55:53 +00:00
Qianfeng Zhang
db39b44bab
Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV
2025-12-11 10:47:54 +00:00
Qianfeng Zhang
8640ffe8eb
Further correction with regard to using n0_loops and k1_loops
2025-12-08 16:03:56 +00:00
Qianfeng Zhang
641dae10e8
Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning
2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857
Simplify the codes in block_gemm
2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519
Further clarification in using kSubQKHeaddim and kQKHeaddim
2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee
Clarify the using of kSubQKHeaddim and kQKHeaddim
2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a
Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines
2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be
Tiny fix in GetQKBlockGemm
2025-11-30 14:04:48 +00:00
Qianfeng Zhang
f01e0ef37d
Enable the using of WarpTile-32x32x16 and add scripts to verify
2025-11-30 04:58:28 +00:00
Qianfeng Zhang
d99493606e
Add static_assert and comments in the with_softmax pipelines
2025-11-28 15:19:33 +00:00
Qianfeng Zhang
f952d3571c
Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950
2025-11-28 14:02:16 +00:00
Qianfeng Zhang
a0e4315d4e
Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline
2025-11-27 15:30:53 +00:00
Qianfeng Zhang
69c97c06d7
Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md
2025-11-27 08:27:52 +00:00
Qianfeng Zhang
f9e8c5539f
Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window
2025-11-23 04:49:01 +00:00
Qianfeng Zhang
4f33eb5857
Merge branch 'develop' into hstu_attention_mi350_fwd_bwd
2025-11-23 04:20:53 +00:00
Emily Martins
2e4b8a8fc4
[CK_TILE] Remove Old CK Tile Stream-K Artifacts ( #3202 )
...
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed
Revert "Add attn sink ( #2892 )" ( #3250 )
...
This reverts commit 9fa4e8d5ab .
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab
Add attn sink ( #2892 )
...
* enable attn sink
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update attn_sink script
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix some error
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* clang-format
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update fmha_bwd_kernel'mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* update block_fmha_pipeline_qr_ks_vs.hpp
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
* fix ci error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix format error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_bwd_pipeline_default_policy.hpp
* Update fmha_fwd_runner.hpp
* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* update splitkv_pipline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update splitkv&pagedkv pipeline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* add sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update attn_sink result log
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update smoke_test_fwd_sink.sh
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test file
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* update test script
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp
* use constexpr kHasSink for sink in fmha pipeline
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* update by pre-commit
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fmha_fwd.py
* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Remove causal mask setting logic from mask.hpp
Removed the mask setting logic for causal masks.
* fix ci error that some usage of lamada not support in c++17
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update remod.py
* add smoke sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update fmha_pagedkv_prefill.py
* Update FmhaFwdPipeline parameters in fmha_fwd.py
* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* fix c++17 unsupprot error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
* Fix formatting of sink_seq_end assignment
* Fix indentation for sink_seq_end assignment
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com >
Signed-off-by: LJ-underdog <Jun.Lin@amd.com >
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-20 19:24:05 +08:00
Yi DING
47e2ed838e
[CK_TILE] Add Flatmm MX FP8 ( #3208 )
...
* Use async for flatmm mxfp4
* Fix preshuffle
* Add flatmm mxfp8
* Thanks, Copilot
* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal
1eb26460aa
[ck_tile] Pooling example - Improved tile sizes ( #3233 )
...
* improved tile sizes
- modified tile sizes for improved example performance
* Update example/ck_tile/36_pooling/pool3d.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-11-19 15:30:18 +01:00