Commit Graph

2929 Commits

Author SHA1 Message Date
Qianfeng Zhang
d099819657 Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh 2026-04-24 07:36:41 +00:00
Qianfeng Zhang
0b6bbe45d6 Remove exposing kUseTrLoad as template parameter of pipeline problem 2026-04-21 15:35:03 +00:00
Qianfeng Zhang
8f0f7ca436 Simplification in the cross_attention testing/benchmarking scripts 2026-04-17 09:38:41 +00:00
Qianfeng Zhang
3f9f2fa736 Remove max_target 3200 cases from cross_attention testing and benchmarking 2026-04-17 09:17:38 +00:00
Qianfeng Zhang
db3263469c Clarify the using the max_seqlen and max_seqlen_q 2026-04-17 09:13:45 +00:00
Qianfeng Zhang
5c84f54fd9 Add scripts for testing/benchmarking cross_attention cases 2026-04-16 15:45:57 +00:00
Qianfeng Zhang
7889844d6b Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example 2026-04-16 07:11:55 +00:00
Qianfeng Zhang
9279af33f1 Add implementation of fwd splitkv on no_softmax path 2026-04-16 07:11:06 +00:00
Qianfeng Zhang
a95f64601d Remove dropout=true instances to reduce compiling-time 2026-04-07 09:38:18 +00:00
Qianfeng Zhang
348c3e05be Rename default_policy to policy for hstu_attention forward 2026-04-07 08:41:58 +00:00
Qianfeng Zhang
423cc72bc4 Move the calling of mask.GetTileRangeAlongX() to the kernel 2026-03-28 14:29:17 +00:00
Qianfeng Zhang
eefe426ef7 Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile() 2026-03-26 14:40:01 +00:00
Qianfeng Zhang
76da618c85 Enable run-time selection of MTile sizes according to the predicted CU utilization ratio 2026-03-20 05:36:44 +00:00
Qianfeng Zhang
302537c5a8 Update to support grouped mode hstu attention 2026-03-09 16:15:58 +00:00
Qianfeng Zhang
73d6e0eb67 Using in-place version of block_tile_reduce() so that using of m_local is avoided 2026-03-05 16:27:41 +00:00
Qianfeng Zhang
2be2c3cd11 Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane() 2026-02-21 14:59:00 +00:00
Qianfeng Zhang
f2a555dac7 Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts 2026-02-09 15:55:13 +00:00
Qianfeng Zhang
6f8b9548b5 Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument 2026-02-09 09:02:17 +00:00
Qianfeng Zhang
bdfa0a74c2 Update to hstu masking to separate the implementation for cross-attention and self-attention 2026-02-08 16:00:47 +00:00
Qianfeng Zhang
0711f4f90a Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention 2026-02-06 15:40:07 +00:00
Qianfeng Zhang
d169ed2194 Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950 2026-02-05 15:57:18 +00:00
Qianfeng Zhang
8af5e26717 Add softmax selection to two of the testing scripts 2026-02-05 15:27:15 +00:00
Qianfeng Zhang
0a8c5f523a [Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling 2026-02-02 15:59:38 +00:00
Qianfeng Zhang
c360e0cbc4 Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256 2026-01-30 10:02:31 +00:00
Qianfeng Zhang
749e83f2fd Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q 2026-01-26 13:45:42 +00:00
Qianfeng Zhang
1d4d925ba3 Fix in K-LdsBuffer and V-LdsBuffer over-lap checking 2025-12-27 05:43:11 +00:00
Qianfeng Zhang
d2dadc22a7 Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop 2025-12-26 15:38:52 +00:00
Qianfeng Zhang
df902c6a06 Tiny fix in using v_tiles[] index 2025-12-25 15:37:22 +00:00
Qianfeng Zhang
2d53d67b6d Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving 2025-12-25 14:58:09 +00:00
Qianfeng Zhang
ddf0f1c8ed Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving 2025-12-25 14:30:57 +00:00
Qianfeng Zhang
02cae85af5 Load Q directly from global memory to registers for BlockGemm 2025-12-20 14:08:55 +00:00
Qianfeng Zhang
3d90b5f90e Remove un-used including from default policy file 2025-12-19 10:13:41 +00:00
Qianfeng Zhang
9e47664092 Move common codes to detail namespace from Problem class scope 2025-12-17 10:37:21 +00:00
Qianfeng Zhang
89daa890d1 Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f) 2025-12-17 07:47:17 +00:00
Qianfeng Zhang
1cf868026b Add support of loading QK tiles of hdim96 without padding to hdim128 2025-12-16 16:39:40 +00:00
Qianfeng Zhang
588f573ee1 Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile 2025-12-16 15:03:57 +00:00
Qianfeng Zhang
179f0e857e Rename WarpTile in fwd setting 2025-12-14 16:40:52 +00:00
Qianfeng Zhang
125934a966 Simplifying the codes in defining KDram and QDram tile distribution 2025-12-14 14:23:56 +00:00
Qianfeng Zhang
1ab5e9da93 Tiny update in GetMaxVectorSize() 2025-12-14 04:43:02 +00:00
Qianfeng Zhang
f79a29ac80 Rename and add scripts for testing hdim96 2025-12-12 16:16:43 +00:00
Qianfeng Zhang
b3d54477f1 Enable hdim96 instances 2025-12-12 16:16:23 +00:00
Qianfeng Zhang
18108d0d54 Fix with regard to define stride in MakeKLdsBlockDescriptor() 2025-12-12 09:55:53 +00:00
Qianfeng Zhang
db39b44bab Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV 2025-12-11 10:47:54 +00:00
Qianfeng Zhang
8640ffe8eb Further correction with regard to using n0_loops and k1_loops 2025-12-08 16:03:56 +00:00
Qianfeng Zhang
641dae10e8 Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning 2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857 Simplify the codes in block_gemm 2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519 Further clarification in using kSubQKHeaddim and kQKHeaddim 2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee Clarify the using of kSubQKHeaddim and kQKHeaddim 2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines 2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be Tiny fix in GetQKBlockGemm 2025-11-30 14:04:48 +00:00