Commit Graph

1316 Commits

Author SHA1 Message Date
Qianfeng Zhang
9cdd64f337 Change while() do to do while() for the main loop to let the compiler to generate more elegant codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3413400044 Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
dcef0260ff Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks 2026-06-23 09:20:57 +00:00
root
07635af84c Tiny fix in hstu attention IsFullTileInsideMask() 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e033a82bd7 Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d39ba52e07 Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
5f16f5db20 Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
88a0838453 Add assert(contextual_seqlen >= 0) in example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
5944a63f11 Update IsFulleTileInsideMask() for kUseLocal is true situtation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
cbc5485589 Move all test and bench scripts to folder scripts 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e36446aca9 Add two scripts 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
a8738f4455 Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3cc6f4abc8 Tune the settings for hdim-256 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
fb5aa39762 Add example parameter alpha to ease the testing 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e59ec37ffb Convert P to fp16/bf16 before doing second gemm in reference hstu implementation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
c83d8587c7 not-critical updates in example and block_masking codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
08e381d632 Add init_qkv and dump_output example parameters for easier debugging 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
85bc8fd805 Add example parameter max_seqlen and max_target 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
46301a85d9 Update to the method for calculating max_seqlen in the example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
a079b95b77 Use NRepetitions2DEpilogue for outputing o_acc tile 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
eba3242ab8 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d74b41070f Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4833daf43d Adjust the threshold values for fp16/bf16 in the example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6e38888f46 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
2b94d9261c Change do-while main-loop to while-do and remove early exiting check 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
abc8335c43 Adjust the codes before the main-loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
946e917e2c Move k_tile loading and v_tile loading earlier in the loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
45ac659ae0 Move k_tile loading in the loop earlier 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
109dcfc2f0 Replace s_acc and pcomp tile array by single tile object for simplification 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
40056b95a9 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
8c43b793c9 Set the block_per_cu to 3 for hdim-128 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
1bbefda240 Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
11718b0af4 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d01b4f27c6 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4545d2efc1 Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6e7553be77 Rename the hacked block_gemm_areg_bsmem_creg_v2 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e5977717a8 Move the lambda for dividing by max_seqlen from kernel to pipeline 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
70237d2e5c Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
27e64a682a Set example option -save_mask default to 0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
0ee9dff5cb Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4ff88b4400 Add -save_mask option to the example to output int8 mask tensor 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
124539e123 Update the rules of hstu masking 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
b2cd7757f0 Add test cases for better functional verification 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3d83d23a55 Fix sequence dim length for o_dram descriptor in the kernel 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
010b3f48b3 Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
2026-06-23 09:20:57 +00:00
Qianfeng Zhang
bce38c1531 Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
87b5aa78bd Use kK1=16 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
bce88a9e73 Use type_convert rather than static_cast in f_silu 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9c3e49a1d0 Add max_seqlen as divider in siLu 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
717aae7ce7 Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention 2026-06-23 09:20:57 +00:00