Commit Graph

1149 Commits

Author SHA1 Message Date
Qianfeng Zhang
85bc8fd805 Add example parameter max_seqlen and max_target 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
46301a85d9 Update to the method for calculating max_seqlen in the example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
a079b95b77 Use NRepetitions2DEpilogue for outputing o_acc tile 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
eba3242ab8 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d74b41070f Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4833daf43d Adjust the threshold values for fp16/bf16 in the example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6e38888f46 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
2b94d9261c Change do-while main-loop to while-do and remove early exiting check 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
abc8335c43 Adjust the codes before the main-loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
946e917e2c Move k_tile loading and v_tile loading earlier in the loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
45ac659ae0 Move k_tile loading in the loop earlier 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
109dcfc2f0 Replace s_acc and pcomp tile array by single tile object for simplification 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
40056b95a9 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
8c43b793c9 Set the block_per_cu to 3 for hdim-128 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
1bbefda240 Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
11718b0af4 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d01b4f27c6 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4545d2efc1 Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6e7553be77 Rename the hacked block_gemm_areg_bsmem_creg_v2 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e5977717a8 Move the lambda for dividing by max_seqlen from kernel to pipeline 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
70237d2e5c Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
27e64a682a Set example option -save_mask default to 0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
0ee9dff5cb Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4ff88b4400 Add -save_mask option to the example to output int8 mask tensor 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
124539e123 Update the rules of hstu masking 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
b2cd7757f0 Add test cases for better functional verification 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3d83d23a55 Fix sequence dim length for o_dram descriptor in the kernel 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
010b3f48b3 Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
2026-06-23 09:20:57 +00:00
Qianfeng Zhang
bce38c1531 Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
87b5aa78bd Use kK1=16 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
bce88a9e73 Use type_convert rather than static_cast in f_silu 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9c3e49a1d0 Add max_seqlen as divider in siLu 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
717aae7ce7 Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
380165c3dc Override and fix GetAlignmentK() 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
34998cfd19 Use kN0=32 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
94f8d71ee2 Temporarily close the instance for hdim64 and hdim256 to save compiling time 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9df0fad750 Hack block_gemm_areg_bsmem_creg_v2 to let s_acc for gemm_0 not need be cleared first 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6bf4877a20 Adjust the v_tile and k_tile loading location 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
ba037426c5 Put two gemms call inside one n0loop unroll 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
23852ef4c0 Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
ddd9227453 Replace set_tile_if() by sweep_tile_span() to reduce branching 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
187f4d3f68 Update the GridSize() and GetTileIndex() in hstu kernel 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9a08e1090e Add scripts for measuring jagged with/no causal cases 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
c0128a9156 Tiny update in IsTokenPairInsideMask() 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
ac0e593e0d Use compiler builtin directly in f_silu for float type 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
31c21c74d8 Code re-arrangement in pipeline 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
eb2564fe46 Update the seqlen_k_curr inside the first gemm loop 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
40683ee932 Rename the performance measurement scripts 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
79fdd564b8 Add support for WarpGem-16x16x32 in QK-BlockGemm (which enables using ds_write/read_b128 for K 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
1986d8c578 Update in K-Lds laying-out to consider for both WarpGemm-32x32x16 and WarpGemm-16x16x16 2026-06-23 09:19:46 +00:00