Qianfeng Zhang
|
de47bfe752
|
Update HstuBlockMaskWithLocal::GetTileRangeAlongX, add comments and test cases for causal == false
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
4ad55eab4d
|
Update to support min_full_attn_seqlen be bigger than max_uih_len
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
0b54f1f43d
|
Add attn_scale MakeKargs() parameter support and update in example, reference codes
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
6364b641b8
|
Replace the integer max_seqlen by float scale_p as kernel/pipeline parameter
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
6784c0be5a
|
Use __builtin_amdgcn_sched_barrier(0x1) to prevent the compiler from unexpected codes arrangement
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
492c724b14
|
Fix added case in test_hstu_attention.sh
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
d3ed6ac473
|
Update in GetTileRangeAlongX to consider for non-causal+local_size>0 situation and add test case to test_hstu_attention.sh
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
7d698c2b78
|
Add three scripts for verification of jagged causal cases
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
14c955aade
|
Fix in GetTileRangeAlongX() and IsFullTileInsideMask() of HstuBlockMaskWithLocal
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
27019a61a0
|
Adjust the codes related to calculate i_m0 in the kernel
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
056166bbeb
|
[Performance] Use separate workgroups to handle seqlen scope [max_uih_len - minfull_attn_seqlen, seqlen]
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
b57939ff64
|
Fix comments in test_pytorch_hstu_mask.py scripts
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
47c4a0c2ec
|
Change the seqlen_q dim padding setting for o_dram and bias_dram
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
7efc0e226a
|
Correct some comments
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
5b295efe1e
|
Re-arrange the codes section for using sched_group_barrier
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
34edc4391c
|
Fix in using sched_group_barrier()
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
1caef1fb89
|
Move store_tile() caled before the current iteration
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
45a189d73d
|
Revert "Disable support of hdim64 amnd hdim256 for quick compiling and testing"
This reverts commit ed062f93e2.
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
09aa41ba9c
|
Disable support of hdim64 amnd hdim256 for quick compiling and testing
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
140af31e86
|
Fix bug in generate_instances.py and re-generate the instances
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
71b0641d75
|
Re-org the kernel parameters in HstuAttentionFwdBatchModeBaseKargs and HstuAttentionFwdJaggModeBaseKargs
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
d131327aff
|
Remove num_target from HstuBlockMask class member since it overlaps the meaning of max_uih_len
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
19542dd99e
|
Fix the calculation of number of instructions used by sched_group_barrier
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
cad7c6b2af
|
[Performance] use iglp compiler instruction to tune the codes around gemm0 for window_size > 0 situation
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
28f08b6f38
|
Add including of block_dropout.hpp in the hstu kernel to avoid potential compiling failure
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
ca36501613
|
Moving code-lines in hstu pipeline
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
d921b14082
|
Remove using i_loop and num_loops since seqlen_k_curr and seqlen_k_end is enough
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
ac5194f783
|
Let causal == 0 cases to do IsFullTileInsideMask() checking before calling IsTokenPairInsideMask()
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
ff807ddd1a
|
Tiny movement in the code lines of the pipeline
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
818722a3c6
|
Use two work-groups per compute-unit for scheduling the kernel
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
1a917993cf
|
Simplify the codes in all host/device IsTokenPairInsideMask() trying to reduce branching
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
57050d93a6
|
Fix masking for min_full_attn_seqlen > 0 situation
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
508f4ac632
|
Update to test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py to align their testings
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
49a12df133
|
Completely remove the dependency to include/ck_tile/ops/fmha/ops headers
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
f62c52a499
|
Fix in using KV LdsBuffers to avoid un-expected over-writting that causes un-deterministic results
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
9cdd64f337
|
Change while() do to do while() for the main loop to let the compiler to generate more elegant codes
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
3413400044
|
Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
dcef0260ff
|
Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks
|
2026-06-23 09:20:57 +00:00 |
|
root
|
07635af84c
|
Tiny fix in hstu attention IsFullTileInsideMask()
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e033a82bd7
|
Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
d39ba52e07
|
Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
5f16f5db20
|
Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
88a0838453
|
Add assert(contextual_seqlen >= 0) in example
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
5944a63f11
|
Update IsFulleTileInsideMask() for kUseLocal is true situtation
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
cbc5485589
|
Move all test and bench scripts to folder scripts
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e36446aca9
|
Add two scripts
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
a8738f4455
|
Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
3cc6f4abc8
|
Tune the settings for hdim-256
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
fb5aa39762
|
Add example parameter alpha to ease the testing
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e59ec37ffb
|
Convert P to fp16/bf16 before doing second gemm in reference hstu implementation
|
2026-06-23 09:20:57 +00:00 |
|