Qianfeng Zhang
|
67f9461b42
|
Simplification in the cross_attention testing/benchmarking scripts
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
bd46155431
|
Remove max_target 3200 cases from cross_attention testing and benchmarking
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f99ed6225b
|
Clarify the using the max_seqlen and max_seqlen_q
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
6f2a73b17d
|
Add scripts for testing/benchmarking cross_attention cases
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
8a7f0a8e99
|
Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
c0922a6cb8
|
Add implementation of fwd splitkv on no_softmax path
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
410f472a33
|
Remove dropout=true instances to reduce compiling-time
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
686125c0cd
|
Rename default_policy to policy for hstu_attention forward
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
27d01448d0
|
Move the calling of mask.GetTileRangeAlongX() to the kernel
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f99d6c4112
|
Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ac9a142e63
|
Enable run-time selection of MTile sizes according to the predicted CU utilization ratio
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
0462d44215
|
Update to support grouped mode hstu attention
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
4781582b0f
|
Using in-place version of block_tile_reduce() so that using of m_local is avoided
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d206045c53
|
Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
57d837977b
|
Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
09656528a6
|
Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d55d6a19a7
|
Update to hstu masking to separate the implementation for cross-attention and self-attention
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
3b674ee8c9
|
Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d8f7e5a791
|
Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
7d110e3872
|
Add softmax selection to two of the testing scripts
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
1e199d0641
|
[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
24d6e49323
|
Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ea9e4d8e00
|
Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
cf971db352
|
Fix in K-LdsBuffer and V-LdsBuffer over-lap checking
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
bcedecce4f
|
Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
500f2245f7
|
Tiny fix in using v_tiles[] index
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f03750674a
|
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f0c4f1bc85
|
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
e448412a5a
|
Load Q directly from global memory to registers for BlockGemm
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
90988313a9
|
Remove un-used including from default policy file
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
c4b8663c00
|
Move common codes to detail namespace from Problem class scope
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
e80c99b672
|
Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
97136bd3f7
|
Add support of loading QK tiles of hdim96 without padding to hdim128
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
13fdb382b2
|
Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d7ddc76542
|
Rename WarpTile in fwd setting
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
7d9221a605
|
Simplifying the codes in defining KDram and QDram tile distribution
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d8f0862ff8
|
Tiny update in GetMaxVectorSize()
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d0fab4c34c
|
Rename and add scripts for testing hdim96
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
cb6fef75ca
|
Enable hdim96 instances
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
69c7921dfa
|
Fix with regard to define stride in MakeKLdsBlockDescriptor()
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
8590f4d71c
|
Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
346c667470
|
Further correction with regard to using n0_loops and k1_loops
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
8120a86ce6
|
Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
84949d4812
|
Simplify the codes in block_gemm
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
9e252c1ab7
|
Further clarification in using kSubQKHeaddim and kQKHeaddim
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
b5178551de
|
Clarify the using of kSubQKHeaddim and kQKHeaddim
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
e4e22cb2d9
|
Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
6ac36b9459
|
Tiny fix in GetQKBlockGemm
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ff54459c23
|
Enable the using of WarpTile-32x32x16 and add scripts to verify
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
c4fc7b28c8
|
Add static_assert and comments in the with_softmax pipelines
|
2026-06-23 09:27:59 +00:00 |
|