Commit Graph

1227 Commits

Author SHA1 Message Date
Qianfeng Zhang
da50eea674 Move scaling by attn_scale to inside the main-loop 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
3639fb8e38 Let IsTokenPairInsideMask() return bool type 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
a71c996049 Add instances to consider for adding softmax support 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
0abb52004a Remove K0 from tile setting since it is not used 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
de198549ad Change to pipeline so that it is easier to add support of using softmax 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
7e79736df7 Remove using IGLP method for instruction scheduling for kUseLocal true path 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
08f50c2c51 Fix in GetQKBlockGemm() 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
5199dbd027 Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
b0710e8871 Remove useless constant statement in the kernel 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
2c861d541f Remove un-necessary HSTU_CHECK() callings 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
bc5616f1dc Add HSTU_CHECK() and use it in example codes 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
f7f90c539e Smalle update in reference hstu attention 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
9c4d76d96b Detach HstuBlockMask from pipeline definition and construct the HstuBlockMask type in the kernel according to window_size 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
8313b34543 Unify the license statements on all the source files 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
2668bb3aee Remove using MakeKargsImpl() to simplify the hstu kernel 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
a8c62920bf Clarify the using of kSubQKHeaddim and kQKHeaddim so that less regular hdim (eg. 96, 160) can be efficiently supported 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
7a948eee1d Use set_slice_tilie() to replace direct thread_buffer assignment 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
260ef2fdf2 Use xor transform to implement Q/K Lds descriptor for kKpack == 8 cases 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
56dab1298f Remove selectable VLayout for simplifying the codes since hdim is always fatest dimension 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
e1ebd780c1 Using separate settings for gfx942 and gfx950 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
40091814c6 Tiny change in pipeline BlockGemm definition to adapt to the latest merging with develop branch 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
ee06d0b4fc Add norm_dist parameter for hstu example to select either normal or uniform distribution to initialize data 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
da5db1773d Tiny fix in HstuBlockMaskWithLocal::GetTileRangeAlongX() 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
446b62ad82 Adjust the atol and rtol and fix the check_err() using in example_hstu_attention.cpp 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
dde0729405 Add simple handling for max_atten_seqlen bigger than max_uih_len situations 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
c649b9b049 Tiny fix and comments in HstuBlockMaskWithLocal::IsFullTimeInsideMask() 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
de47bfe752 Update HstuBlockMaskWithLocal::GetTileRangeAlongX, add comments and test cases for causal == false 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
4ad55eab4d Update to support min_full_attn_seqlen be bigger than max_uih_len 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
0b54f1f43d Add attn_scale MakeKargs() parameter support and update in example, reference codes 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
6364b641b8 Replace the integer max_seqlen by float scale_p as kernel/pipeline parameter 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
6784c0be5a Use __builtin_amdgcn_sched_barrier(0x1) to prevent the compiler from unexpected codes arrangement 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
492c724b14 Fix added case in test_hstu_attention.sh 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
d3ed6ac473 Update in GetTileRangeAlongX to consider for non-causal+local_size>0 situation and add test case to test_hstu_attention.sh 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
7d698c2b78 Add three scripts for verification of jagged causal cases 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
14c955aade Fix in GetTileRangeAlongX() and IsFullTileInsideMask() of HstuBlockMaskWithLocal 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
27019a61a0 Adjust the codes related to calculate i_m0 in the kernel 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
056166bbeb [Performance] Use separate workgroups to handle seqlen scope [max_uih_len - minfull_attn_seqlen, seqlen] 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
b57939ff64 Fix comments in test_pytorch_hstu_mask.py scripts 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
47c4a0c2ec Change the seqlen_q dim padding setting for o_dram and bias_dram 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
7efc0e226a Correct some comments 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
5b295efe1e Re-arrange the codes section for using sched_group_barrier 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
34edc4391c Fix in using sched_group_barrier() 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
1caef1fb89 Move store_tile() caled before the current iteration 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
45a189d73d Revert "Disable support of hdim64 amnd hdim256 for quick compiling and testing"
This reverts commit ed062f93e2.
2026-06-23 09:20:58 +00:00
Qianfeng Zhang
09aa41ba9c Disable support of hdim64 amnd hdim256 for quick compiling and testing 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
140af31e86 Fix bug in generate_instances.py and re-generate the instances 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
71b0641d75 Re-org the kernel parameters in HstuAttentionFwdBatchModeBaseKargs and HstuAttentionFwdJaggModeBaseKargs 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
d131327aff Remove num_target from HstuBlockMask class member since it overlaps the meaning of max_uih_len 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
19542dd99e Fix the calculation of number of instructions used by sched_group_barrier 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
cad7c6b2af [Performance] use iglp compiler instruction to tune the codes around gemm0 for window_size > 0 situation 2026-06-23 09:20:58 +00:00