Commit Graph

3692 Commits

Author SHA1 Message Date
Qianfeng Zhang
e48bcff488 Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
c6dfe030d0 Add -fno-slp-vectorize option for building hstu kernels on gfx950 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
1f4319ce91 Use include <...> format to refer to header files from ck_tile 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
a97c7a75ce Mark low probability branch as unlikely in the softmax pipelines 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
68bbcac775 Use type_convert to convert float constant to CompDataType 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d243b275da Implement conditional softmax rescale in trload with_softmax pipeline 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
671c65e178 Implement conditional softmax rescale in non-trload with_softmax pipeline 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
cfbd8a342a Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
62bf2296c6 Remove exposing kUseTrLoad as template parameter of pipeline problem 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
67f9461b42 Simplification in the cross_attention testing/benchmarking scripts 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
bd46155431 Remove max_target 3200 cases from cross_attention testing and benchmarking 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
f99ed6225b Clarify the using the max_seqlen and max_seqlen_q 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
6f2a73b17d Add scripts for testing/benchmarking cross_attention cases 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
8a7f0a8e99 Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
c0922a6cb8 Add implementation of fwd splitkv on no_softmax path 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
410f472a33 Remove dropout=true instances to reduce compiling-time 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
686125c0cd Rename default_policy to policy for hstu_attention forward 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
27d01448d0 Move the calling of mask.GetTileRangeAlongX() to the kernel 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
f99d6c4112 Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile() 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
ac9a142e63 Enable run-time selection of MTile sizes according to the predicted CU utilization ratio 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
0462d44215 Update to support grouped mode hstu attention 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
4781582b0f Using in-place version of block_tile_reduce() so that using of m_local is avoided 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d206045c53 Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane() 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
57d837977b Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
09656528a6 Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d55d6a19a7 Update to hstu masking to separate the implementation for cross-attention and self-attention 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
3b674ee8c9 Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d8f7e5a791 Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
7d110e3872 Add softmax selection to two of the testing scripts 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
1e199d0641 [Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
24d6e49323 Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
ea9e4d8e00 Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
cf971db352 Fix in K-LdsBuffer and V-LdsBuffer over-lap checking 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
bcedecce4f Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
500f2245f7 Tiny fix in using v_tiles[] index 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
f03750674a Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
f0c4f1bc85 Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
e448412a5a Load Q directly from global memory to registers for BlockGemm 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
90988313a9 Remove un-used including from default policy file 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
c4b8663c00 Move common codes to detail namespace from Problem class scope 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
e80c99b672 Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f) 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
97136bd3f7 Add support of loading QK tiles of hdim96 without padding to hdim128 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
13fdb382b2 Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d7ddc76542 Rename WarpTile in fwd setting 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
7d9221a605 Simplifying the codes in defining KDram and QDram tile distribution 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d8f0862ff8 Tiny update in GetMaxVectorSize() 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d0fab4c34c Rename and add scripts for testing hdim96 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
cb6fef75ca Enable hdim96 instances 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
69c7921dfa Fix with regard to define stride in MakeKLdsBlockDescriptor() 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
8590f4d71c Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV 2026-06-23 09:27:59 +00:00