Commit Graph

1104 Commits

Author SHA1 Message Date
Qianfeng Zhang
5ee8a37cd3 Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct 2026-06-03 09:55:04 +00:00
Qianfeng Zhang
36dd77fb16 Add kStoreLSE template parameter to the problems 2026-06-03 09:55:04 +00:00
Qianfeng Zhang
333abddbae Rename the reference interfaces and the files 2026-05-28 08:07:54 +00:00
Qianfeng Zhang
e841981ddd Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp 2026-05-23 07:24:00 +00:00
Qianfeng Zhang
1dbd127d1b Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf) 2026-05-23 04:37:52 +00:00
Qianfeng Zhang
86d8d72008 Use partition_index parameter for all get_x_indices_from_distributed_indices() calls 2026-05-22 15:08:30 +00:00
Qianfeng Zhang
65992be728 Update to the cross_attention test/bench scripts 2026-05-22 14:44:14 +00:00
Qianfeng Zhang
b1052e87e1 Add implementation of hstu fwd splitkv for softmax path 2026-05-22 09:54:27 +00:00
Qianfeng Zhang
8a7529177d Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack 2026-05-11 08:53:07 +00:00
Qianfeng Zhang
0a32eddc0a Re-format the .hpp/.cpp files using clang-format-18 2026-05-10 13:46:02 +00:00
Qianfeng Zhang
6981f148ee Fix potential bug in kernel host interface BlockSize() 2026-05-08 06:27:19 -04:00
Qianfeng Zhang
250f325c3a More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy 2026-05-07 06:32:01 -04:00
Qianfeng Zhang
888b6cad86 Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950 2026-04-30 14:06:06 +00:00
Qianfeng Zhang
4c583f0574 Add -fno-slp-vectorize option for building hstu kernels on gfx950 2026-04-30 13:37:22 +00:00
Qianfeng Zhang
7883f52d9f Use include <...> format to refer to header files from ck_tile 2026-04-27 10:15:14 +00:00
Qianfeng Zhang
d0803f263d Mark low probability branch as unlikely in the softmax pipelines 2026-04-27 07:24:03 +00:00
Qianfeng Zhang
b9d4be0982 Use type_convert to convert float constant to CompDataType 2026-04-24 15:46:26 +00:00
Qianfeng Zhang
1f2e2a272e Implement conditional softmax rescale in trload with_softmax pipeline 2026-04-24 09:54:12 +00:00
Qianfeng Zhang
90e718f73d Implement conditional softmax rescale in non-trload with_softmax pipeline 2026-04-24 09:22:44 +00:00
Qianfeng Zhang
d099819657 Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh 2026-04-24 07:36:41 +00:00
Qianfeng Zhang
0b6bbe45d6 Remove exposing kUseTrLoad as template parameter of pipeline problem 2026-04-21 15:35:03 +00:00
Qianfeng Zhang
8f0f7ca436 Simplification in the cross_attention testing/benchmarking scripts 2026-04-17 09:38:41 +00:00
Qianfeng Zhang
3f9f2fa736 Remove max_target 3200 cases from cross_attention testing and benchmarking 2026-04-17 09:17:38 +00:00
Qianfeng Zhang
db3263469c Clarify the using the max_seqlen and max_seqlen_q 2026-04-17 09:13:45 +00:00
Qianfeng Zhang
5c84f54fd9 Add scripts for testing/benchmarking cross_attention cases 2026-04-16 15:45:57 +00:00
Qianfeng Zhang
7889844d6b Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example 2026-04-16 07:11:55 +00:00
Qianfeng Zhang
9279af33f1 Add implementation of fwd splitkv on no_softmax path 2026-04-16 07:11:06 +00:00
Qianfeng Zhang
a95f64601d Remove dropout=true instances to reduce compiling-time 2026-04-07 09:38:18 +00:00
Qianfeng Zhang
348c3e05be Rename default_policy to policy for hstu_attention forward 2026-04-07 08:41:58 +00:00
Qianfeng Zhang
423cc72bc4 Move the calling of mask.GetTileRangeAlongX() to the kernel 2026-03-28 14:29:17 +00:00
Qianfeng Zhang
eefe426ef7 Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile() 2026-03-26 14:40:01 +00:00
Qianfeng Zhang
76da618c85 Enable run-time selection of MTile sizes according to the predicted CU utilization ratio 2026-03-20 05:36:44 +00:00
Qianfeng Zhang
302537c5a8 Update to support grouped mode hstu attention 2026-03-09 16:15:58 +00:00
Qianfeng Zhang
73d6e0eb67 Using in-place version of block_tile_reduce() so that using of m_local is avoided 2026-03-05 16:27:41 +00:00
Qianfeng Zhang
2be2c3cd11 Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane() 2026-02-21 14:59:00 +00:00
Qianfeng Zhang
f2a555dac7 Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts 2026-02-09 15:55:13 +00:00
Qianfeng Zhang
6f8b9548b5 Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument 2026-02-09 09:02:17 +00:00
Qianfeng Zhang
bdfa0a74c2 Update to hstu masking to separate the implementation for cross-attention and self-attention 2026-02-08 16:00:47 +00:00
Qianfeng Zhang
0711f4f90a Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention 2026-02-06 15:40:07 +00:00
Qianfeng Zhang
d169ed2194 Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950 2026-02-05 15:57:18 +00:00
Qianfeng Zhang
8af5e26717 Add softmax selection to two of the testing scripts 2026-02-05 15:27:15 +00:00
Qianfeng Zhang
0a8c5f523a [Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling 2026-02-02 15:59:38 +00:00
Qianfeng Zhang
c360e0cbc4 Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256 2026-01-30 10:02:31 +00:00
Qianfeng Zhang
749e83f2fd Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q 2026-01-26 13:45:42 +00:00
Qianfeng Zhang
1d4d925ba3 Fix in K-LdsBuffer and V-LdsBuffer over-lap checking 2025-12-27 05:43:11 +00:00
Qianfeng Zhang
d2dadc22a7 Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop 2025-12-26 15:38:52 +00:00
Qianfeng Zhang
df902c6a06 Tiny fix in using v_tiles[] index 2025-12-25 15:37:22 +00:00
Qianfeng Zhang
2d53d67b6d Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving 2025-12-25 14:58:09 +00:00
Qianfeng Zhang
ddf0f1c8ed Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving 2025-12-25 14:30:57 +00:00
Qianfeng Zhang
02cae85af5 Load Q directly from global memory to registers for BlockGemm 2025-12-20 14:08:55 +00:00