Commit Graph

1122 Commits

Author SHA1 Message Date
Qianfeng Zhang
010b3f48b3 Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
2026-06-23 09:20:57 +00:00
Qianfeng Zhang
bce38c1531 Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
87b5aa78bd Use kK1=16 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
bce88a9e73 Use type_convert rather than static_cast in f_silu 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9c3e49a1d0 Add max_seqlen as divider in siLu 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
717aae7ce7 Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
380165c3dc Override and fix GetAlignmentK() 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
34998cfd19 Use kN0=32 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
94f8d71ee2 Temporarily close the instance for hdim64 and hdim256 to save compiling time 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9df0fad750 Hack block_gemm_areg_bsmem_creg_v2 to let s_acc for gemm_0 not need be cleared first 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
6bf4877a20 Adjust the v_tile and k_tile loading location 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
ba037426c5 Put two gemms call inside one n0loop unroll 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
23852ef4c0 Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
ddd9227453 Replace set_tile_if() by sweep_tile_span() to reduce branching 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
187f4d3f68 Update the GridSize() and GetTileIndex() in hstu kernel 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
9a08e1090e Add scripts for measuring jagged with/no causal cases 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
c0128a9156 Tiny update in IsTokenPairInsideMask() 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
ac0e593e0d Use compiler builtin directly in f_silu for float type 2026-06-23 09:20:57 +00:00
Qianfeng Zhang
31c21c74d8 Code re-arrangement in pipeline 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
eb2564fe46 Update the seqlen_k_curr inside the first gemm loop 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
40683ee932 Rename the performance measurement scripts 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
79fdd564b8 Add support for WarpGem-16x16x32 in QK-BlockGemm (which enables using ds_write/read_b128 for K 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
1986d8c578 Update in K-Lds laying-out to consider for both WarpGemm-32x32x16 and WarpGemm-16x16x16 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
6dd83b2a5a Use 16x16x16 WarpGemm 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
7153a99dd4 Using __builtin_amdgcn_rcpf in siLU function 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
fb89a013b7 Combine minus with scale_s 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
23e80a5964 Move silu calculation to gemm1 iteration and try to interleave gemm_1 and silu 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
f40d68c1a9 Update in using masking for the case where kMasking is false and kPadSeqLenK is true 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
95b9a277ac Fix in generate_instances.py and re-generated the instances 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
e5fb03a4aa Back to use exp() instead of exp2() since exp() in ck_tile using fast __builtin_amdgcn_exp2f() 2026-06-23 09:19:46 +00:00
Qianfeng Zhang
266e7bc8e9 Use kN0=64 to save vgprs 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8f7a97fe02 Fix the script name 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
bace12feac Fix in GetTileRangeAlongX 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
7a7c17802a Add script compare_with_triton_2.sh for measuring the jagged cases of seqlen 1024/2048/4096/8192/16384/32768 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
e731437af1 Change gemm0 to iterate along kN0 so that BlockGemm can overlap with maksing and siLu 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8da21d9cde Fix the GetTileRangeAlongX() to align with the hstu masking definition when both causal=true and local=true 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
065776d42d Remove un-needed __builtin_amdgcn_sched_barrier(0) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
63acd4638b Use shared ring Lds buffers for K/V to avoid over-lapping between first-K/last-V or last-K/first-V 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
58090fe730 Tiny codes simplification in pipeline 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
ec14e9df3e Remove one line of __builtin_amdgcn_sched_barrier(0) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
c0609d49cd Fix the integer overflow in total_flops calculation 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
1efb2a8f38 Add scripts for comparing with triton 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
70b4aa310f Use exp2() to calculate exp() for better performance 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
45019fd5fd Remove the comparing of row/col to max_uih_len in masking 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
ad10a2dd53 Use kM0=128 kN0=64 to completely remove the vgprs spilling 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8b2948b31e Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
fafb375122 Use packed cast_tile for fp16 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
6686c7af44 Update to partially reduce the register spilling 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
459c5565d4 Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier() 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8a6c2591b0 Update the in pipeline codes 2026-06-23 09:17:26 +00:00