Qianfeng Zhang
|
ff807ddd1a
|
Tiny movement in the code lines of the pipeline
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
818722a3c6
|
Use two work-groups per compute-unit for scheduling the kernel
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
1a917993cf
|
Simplify the codes in all host/device IsTokenPairInsideMask() trying to reduce branching
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
57050d93a6
|
Fix masking for min_full_attn_seqlen > 0 situation
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
508f4ac632
|
Update to test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py to align their testings
|
2026-06-23 09:20:58 +00:00 |
|
Qianfeng Zhang
|
49a12df133
|
Completely remove the dependency to include/ck_tile/ops/fmha/ops headers
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
f62c52a499
|
Fix in using KV LdsBuffers to avoid un-expected over-writting that causes un-deterministic results
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
9cdd64f337
|
Change while() do to do while() for the main loop to let the compiler to generate more elegant codes
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
3413400044
|
Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
dcef0260ff
|
Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks
|
2026-06-23 09:20:57 +00:00 |
|
root
|
07635af84c
|
Tiny fix in hstu attention IsFullTileInsideMask()
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e033a82bd7
|
Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
d39ba52e07
|
Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
5f16f5db20
|
Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
88a0838453
|
Add assert(contextual_seqlen >= 0) in example
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
5944a63f11
|
Update IsFulleTileInsideMask() for kUseLocal is true situtation
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
cbc5485589
|
Move all test and bench scripts to folder scripts
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e36446aca9
|
Add two scripts
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
a8738f4455
|
Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
3cc6f4abc8
|
Tune the settings for hdim-256
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
fb5aa39762
|
Add example parameter alpha to ease the testing
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e59ec37ffb
|
Convert P to fp16/bf16 before doing second gemm in reference hstu implementation
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
c83d8587c7
|
not-critical updates in example and block_masking codes
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
08e381d632
|
Add init_qkv and dump_output example parameters for easier debugging
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
85bc8fd805
|
Add example parameter max_seqlen and max_target
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
46301a85d9
|
Update to the method for calculating max_seqlen in the example
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
a079b95b77
|
Use NRepetitions2DEpilogue for outputing o_acc tile
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
eba3242ab8
|
Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
d74b41070f
|
Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
4833daf43d
|
Adjust the threshold values for fp16/bf16 in the example
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
6e38888f46
|
Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
2b94d9261c
|
Change do-while main-loop to while-do and remove early exiting check
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
abc8335c43
|
Adjust the codes before the main-loop
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
946e917e2c
|
Move k_tile loading and v_tile loading earlier in the loop
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
45ac659ae0
|
Move k_tile loading in the loop earlier
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
109dcfc2f0
|
Replace s_acc and pcomp tile array by single tile object for simplification
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
40056b95a9
|
Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
8c43b793c9
|
Set the block_per_cu to 3 for hdim-128
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
1bbefda240
|
Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
11718b0af4
|
Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
d01b4f27c6
|
Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
4545d2efc1
|
Hack block_gemm_areg_bsmem_creg_v2 for gemm_1
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
6e7553be77
|
Rename the hacked block_gemm_areg_bsmem_creg_v2
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
e5977717a8
|
Move the lambda for dividing by max_seqlen from kernel to pipeline
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
70237d2e5c
|
Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
27e64a682a
|
Set example option -save_mask default to 0
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
0ee9dff5cb
|
Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
4ff88b4400
|
Add -save_mask option to the example to output int8 mask tensor
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
124539e123
|
Update the rules of hstu masking
|
2026-06-23 09:20:57 +00:00 |
|
Qianfeng Zhang
|
b2cd7757f0
|
Add test cases for better functional verification
|
2026-06-23 09:20:57 +00:00 |
|