Commit Graph

1173 Commits

Author SHA1 Message Date
Qianfeng Zhang
ff807ddd1a Tiny movement in the code lines of the pipeline 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
818722a3c6 Use two work-groups per compute-unit for scheduling the kernel 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
1a917993cf Simplify the codes in all host/device IsTokenPairInsideMask() trying to reduce branching 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
57050d93a6 Fix masking for min_full_attn_seqlen > 0 situation 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
508f4ac632 Update to test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py to align their testings 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
49a12df133 Completely remove the dependency to include/ck_tile/ops/fmha/ops headers 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
f62c52a499 Fix in using KV LdsBuffers to avoid un-expected over-writting that causes un-deterministic results 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9cdd64f337 Change while() do to do while() for the main loop to let the compiler to generate more elegant codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3413400044 Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
dcef0260ff Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks 2026-06-23 09:20:57 +00:00
root
07635af84c Tiny fix in hstu attention IsFullTileInsideMask() 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e033a82bd7 Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d39ba52e07 Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
5f16f5db20 Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
88a0838453 Add assert(contextual_seqlen >= 0) in example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
5944a63f11 Update IsFulleTileInsideMask() for kUseLocal is true situtation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
cbc5485589 Move all test and bench scripts to folder scripts 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e36446aca9 Add two scripts 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
a8738f4455 Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3cc6f4abc8 Tune the settings for hdim-256 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
fb5aa39762 Add example parameter alpha to ease the testing 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e59ec37ffb Convert P to fp16/bf16 before doing second gemm in reference hstu implementation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
c83d8587c7 not-critical updates in example and block_masking codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
08e381d632 Add init_qkv and dump_output example parameters for easier debugging 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
85bc8fd805 Add example parameter max_seqlen and max_target 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
46301a85d9 Update to the method for calculating max_seqlen in the example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
a079b95b77 Use NRepetitions2DEpilogue for outputing o_acc tile 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
eba3242ab8 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d74b41070f Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4833daf43d Adjust the threshold values for fp16/bf16 in the example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6e38888f46 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
2b94d9261c Change do-while main-loop to while-do and remove early exiting check 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
abc8335c43 Adjust the codes before the main-loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
946e917e2c Move k_tile loading and v_tile loading earlier in the loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
45ac659ae0 Move k_tile loading in the loop earlier 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
109dcfc2f0 Replace s_acc and pcomp tile array by single tile object for simplification 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
40056b95a9 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
8c43b793c9 Set the block_per_cu to 3 for hdim-128 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
1bbefda240 Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
11718b0af4 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d01b4f27c6 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4545d2efc1 Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6e7553be77 Rename the hacked block_gemm_areg_bsmem_creg_v2 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e5977717a8 Move the lambda for dividing by max_seqlen from kernel to pipeline 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
70237d2e5c Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
27e64a682a Set example option -save_mask default to 0 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
0ee9dff5cb Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
4ff88b4400 Add -save_mask option to the example to output int8 mask tensor 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
124539e123 Update the rules of hstu masking 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
b2cd7757f0 Add test cases for better functional verification 2026-06-23 09:20:57 +00:00