Commit Graph

1271 Commits

Author SHA1 Message Date
Qianfeng Zhang
d8f0862ff8 Tiny update in GetMaxVectorSize() 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
d0fab4c34c Rename and add scripts for testing hdim96 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
cb6fef75ca Enable hdim96 instances 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
69c7921dfa Fix with regard to define stride in MakeKLdsBlockDescriptor() 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
8590f4d71c Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
346c667470 Further correction with regard to using n0_loops and k1_loops 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
8120a86ce6 Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
84949d4812 Simplify the codes in block_gemm 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
9e252c1ab7 Further clarification in using kSubQKHeaddim and kQKHeaddim 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
b5178551de Clarify the using of kSubQKHeaddim and kQKHeaddim 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
e4e22cb2d9 Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
6ac36b9459 Tiny fix in GetQKBlockGemm 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
ff54459c23 Enable the using of WarpTile-32x32x16 and add scripts to verify 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
c4fc7b28c8 Add static_assert and comments in the with_softmax pipelines 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
c98688d5ad Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
fa5b077b91 Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
311fb8a379 Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
cf2172b580 Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
45590fb8b1 Remove useless codes in the two trload pipelines 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
ec9d1fe253 Separate Traits from Problem while being used for defining the pipeline 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
7655ffd2fb Remove the k_element_func and v_element_func from the pipeline since they are not used 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
2ebe159050 Update to the two trload pipeline to load whole Q-tile once through LDS on mi350 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
0253442ef9 Simplify the codes in block_gemm_areg_bsmem_creg_v2_hack_1 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
0e1444a73c Simplify the codes in block_gemm_areg_bsmem_trload_creg 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
76b515e092 [Performance] Change the tile settings for mi350/trload no_softmax pipeline to enable to use mfma-16x16x32 for Gemm-1 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
f608f29401 Improve the softmax+trload pipeline by using kN0=64 and prefetch only two k tiles 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
7d9a95dbe0 Tiny fix in trload with_softmax/no_softmax pipeline 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
e7bfee6043 Improve both the with_softmax and no_softmax pipelines 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
3e122a551d Add kUseTrLoad = false in non-trload pipeline 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
660604635d Clarifying the using of CK_TILE_HOST and CK_TILE_HOST_DEVICE trying to save compiling time 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
3565656005 Change in updating max_uih_seqlen in the example 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
639332b5de Use supplement_array_by_last_element(num_targets, ) in example 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
19a887cffa Use supplement_array_by_last_element() in example to simplify the codes 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
9537b958ab Update to README.md 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
2ae759c5b3 Add scripts for testing the using of separate sequence lengths for k/v 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
a926db031a Support separate sequence lengths for q and kv 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
bcefc197d8 Use separate pipelines for using or not-using softmax situations 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
4b3901e989 Implementation of hstu attention pipeline using trload for v on mi350 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
5db875df2c Fix in the comments 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
563229a9e4 Update to gemm_0's CBlockDistribution encoding so that it is compatible with gemm_1's ABlockDistribution encoding 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
4e9d4f7487 Using separate tile settings for no-softmax and with-softmax hstu attention situations 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
91afdfee40 Update to benchmark scripts to consider for using softmax 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
be3b27edd3 Add support of softmax in hstu attention 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
ec4f174ac4 Add template parameter to gemm_0 MakeCBlockTile() for the need of defining PcompBlockTileType 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
da50eea674 Move scaling by attn_scale to inside the main-loop 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
3639fb8e38 Let IsTokenPairInsideMask() return bool type 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
a71c996049 Add instances to consider for adding softmax support 2026-06-23 09:27:59 +00:00
Qianfeng Zhang
0abb52004a Remove K0 from tile setting since it is not used 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
de198549ad Change to pipeline so that it is easier to add support of using softmax 2026-06-23 09:27:58 +00:00
Qianfeng Zhang
7e79736df7 Remove using IGLP method for instruction scheduling for kUseLocal true path 2026-06-23 09:27:58 +00:00