Qianfeng Zhang
|
dcef0260ff
|
Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks
|
2026-06-23 09:20:57 +00:00 |
|
root
|
07635af84c
|
Tiny fix in hstu attention IsFullTileInsideMask()
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e033a82bd7
|
Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
d39ba52e07
|
Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
5f16f5db20
|
Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
88a0838453
|
Add assert(contextual_seqlen >= 0) in example
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
5944a63f11
|
Update IsFulleTileInsideMask() for kUseLocal is true situtation
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
cbc5485589
|
Move all test and bench scripts to folder scripts
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e36446aca9
|
Add two scripts
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
a8738f4455
|
Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
3cc6f4abc8
|
Tune the settings for hdim-256
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
fb5aa39762
|
Add example parameter alpha to ease the testing
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e59ec37ffb
|
Convert P to fp16/bf16 before doing second gemm in reference hstu implementation
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
c83d8587c7
|
not-critical updates in example and block_masking codes
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
08e381d632
|
Add init_qkv and dump_output example parameters for easier debugging
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
85bc8fd805
|
Add example parameter max_seqlen and max_target
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
46301a85d9
|
Update to the method for calculating max_seqlen in the example
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
a079b95b77
|
Use NRepetitions2DEpilogue for outputing o_acc tile
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
eba3242ab8
|
Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
d74b41070f
|
Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
4833daf43d
|
Adjust the threshold values for fp16/bf16 in the example
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
6e38888f46
|
Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
2b94d9261c
|
Change do-while main-loop to while-do and remove early exiting check
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
abc8335c43
|
Adjust the codes before the main-loop
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
946e917e2c
|
Move k_tile loading and v_tile loading earlier in the loop
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
45ac659ae0
|
Move k_tile loading in the loop earlier
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
109dcfc2f0
|
Replace s_acc and pcomp tile array by single tile object for simplification
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
40056b95a9
|
Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
8c43b793c9
|
Set the block_per_cu to 3 for hdim-128
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
1bbefda240
|
Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
11718b0af4
|
Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
d01b4f27c6
|
Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
4545d2efc1
|
Hack block_gemm_areg_bsmem_creg_v2 for gemm_1
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
6e7553be77
|
Rename the hacked block_gemm_areg_bsmem_creg_v2
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e5977717a8
|
Move the lambda for dividing by max_seqlen from kernel to pipeline
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
70237d2e5c
|
Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
27e64a682a
|
Set example option -save_mask default to 0
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
0ee9dff5cb
|
Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
4ff88b4400
|
Add -save_mask option to the example to output int8 mask tensor
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
124539e123
|
Update the rules of hstu masking
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
b2cd7757f0
|
Add test cases for better functional verification
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
3d83d23a55
|
Fix sequence dim length for o_dram descriptor in the kernel
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
010b3f48b3
|
Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
bce38c1531
|
Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
87b5aa78bd
|
Use kK1=16
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
bce88a9e73
|
Use type_convert rather than static_cast in f_silu
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
9c3e49a1d0
|
Add max_seqlen as divider in siLu
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
717aae7ce7
|
Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
380165c3dc
|
Override and fix GetAlignmentK()
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
34998cfd19
|
Use kN0=32
|
2026-06-23 09:20:57 +00:00 |
|