Commit Graph

708 Commits

Author SHA1 Message Date
Qianfeng Zhang
0b54f1f43d Add attn_scale MakeKargs() parameter support and update in example, reference codes 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
6364b641b8 Replace the integer max_seqlen by float scale_p as kernel/pipeline parameter 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
6784c0be5a Use __builtin_amdgcn_sched_barrier(0x1) to prevent the compiler from unexpected codes arrangement 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
492c724b14 Fix added case in test_hstu_attention.sh 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
d3ed6ac473 Update in GetTileRangeAlongX to consider for non-causal+local_size>0 situation and add test case to test_hstu_attention.sh 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
7d698c2b78 Add three scripts for verification of jagged causal cases 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
14c955aade Fix in GetTileRangeAlongX() and IsFullTileInsideMask() of HstuBlockMaskWithLocal 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
27019a61a0 Adjust the codes related to calculate i_m0 in the kernel 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
056166bbeb [Performance] Use separate workgroups to handle seqlen scope [max_uih_len - minfull_attn_seqlen, seqlen] 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
b57939ff64 Fix comments in test_pytorch_hstu_mask.py scripts 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
47c4a0c2ec Change the seqlen_q dim padding setting for o_dram and bias_dram 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
7efc0e226a Correct some comments 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
5b295efe1e Re-arrange the codes section for using sched_group_barrier 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
34edc4391c Fix in using sched_group_barrier() 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
1caef1fb89 Move store_tile() caled before the current iteration 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
45a189d73d Revert "Disable support of hdim64 amnd hdim256 for quick compiling and testing"
This reverts commit ed062f93e2.
2026-06-23 09:20:58 +00:00
Qianfeng Zhang
09aa41ba9c Disable support of hdim64 amnd hdim256 for quick compiling and testing 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
140af31e86 Fix bug in generate_instances.py and re-generate the instances 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
71b0641d75 Re-org the kernel parameters in HstuAttentionFwdBatchModeBaseKargs and HstuAttentionFwdJaggModeBaseKargs 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
d131327aff Remove num_target from HstuBlockMask class member since it overlaps the meaning of max_uih_len 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
19542dd99e Fix the calculation of number of instructions used by sched_group_barrier 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
cad7c6b2af [Performance] use iglp compiler instruction to tune the codes around gemm0 for window_size > 0 situation 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
28f08b6f38 Add including of block_dropout.hpp in the hstu kernel to avoid potential compiling failure 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
ca36501613 Moving code-lines in hstu pipeline 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
d921b14082 Remove using i_loop and num_loops since seqlen_k_curr and seqlen_k_end is enough 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
ac5194f783 Let causal == 0 cases to do IsFullTileInsideMask() checking before calling IsTokenPairInsideMask() 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
ff807ddd1a Tiny movement in the code lines of the pipeline 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
818722a3c6 Use two work-groups per compute-unit for scheduling the kernel 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
1a917993cf Simplify the codes in all host/device IsTokenPairInsideMask() trying to reduce branching 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
57050d93a6 Fix masking for min_full_attn_seqlen > 0 situation 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
508f4ac632 Update to test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py to align their testings 2026-06-23 09:20:58 +00:00
Qianfeng Zhang
49a12df133 Completely remove the dependency to include/ck_tile/ops/fmha/ops headers 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
f62c52a499 Fix in using KV LdsBuffers to avoid un-expected over-writting that causes un-deterministic results 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9cdd64f337 Change while() do to do while() for the main loop to let the compiler to generate more elegant codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3413400044 Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
dcef0260ff Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks 2026-06-23 09:20:57 +00:00
root
07635af84c Tiny fix in hstu attention IsFullTileInsideMask() 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e033a82bd7 Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
d39ba52e07 Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
5f16f5db20 Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
88a0838453 Add assert(contextual_seqlen >= 0) in example 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
5944a63f11 Update IsFulleTileInsideMask() for kUseLocal is true situtation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
cbc5485589 Move all test and bench scripts to folder scripts 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e36446aca9 Add two scripts 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
a8738f4455 Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
3cc6f4abc8 Tune the settings for hdim-256 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
fb5aa39762 Add example parameter alpha to ease the testing 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
e59ec37ffb Convert P to fp16/bf16 before doing second gemm in reference hstu implementation 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
c83d8587c7 not-critical updates in example and block_masking codes 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
08e381d632 Add init_qkv and dump_output example parameters for easier debugging 2026-06-23 09:20:57 +00:00