Commit Graph

1113 Commits

Author SHA1 Message Date
Qianfeng Zhang
798fd3cd8b Enable the kernel dispatching path from is_training & use_softmax to kStoreLSE 2026-06-05 15:51:39 +00:00
Qianfeng Zhang
8b62d651a4 Add instances and kStoreLSE template in dispatcher class to support outputting lse for fwd training 2026-06-05 15:51:39 +00:00
Qianfeng Zhang
c1e3b9be6a Set lse tensor dim strides in example 2026-06-05 15:51:39 +00:00
Qianfeng Zhang
0fdca0e940 Replace template kUseSoftmax/kStoreLSE by boolean parameters in reference fwd codes to save compiling time 2026-06-05 15:51:39 +00:00
Qianfeng Zhang
8f091efc4c Add support for outputing lse in the example and reference hstu attention forward implementation 2026-06-05 15:51:39 +00:00
Qianfeng Zhang
cc184fc202 Add support for preparing lse_dram_window in hstu fwd kernel 2026-06-05 15:51:39 +00:00
Qianfeng Zhang
75a3b5aab0 Kernel use types declared in the problem rather than the pipeline 2026-06-05 15:51:39 +00:00
Qianfeng Zhang
696d4534b2 Tiny simplification with defining the Bias related Kargs 2026-06-03 09:55:04 +00:00
Qianfeng Zhang
eba3c2f635 Add parameters used by storing lse in the fwd and fwd_splitkv_combine kernel to prepare for supporting training 2026-06-03 09:55:04 +00:00
Qianfeng Zhang
5ee8a37cd3 Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct 2026-06-03 09:55:04 +00:00
Qianfeng Zhang
36dd77fb16 Add kStoreLSE template parameter to the problems 2026-06-03 09:55:04 +00:00
Qianfeng Zhang
333abddbae Rename the reference interfaces and the files 2026-05-28 08:07:54 +00:00
Qianfeng Zhang
e841981ddd Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp 2026-05-23 07:24:00 +00:00
Qianfeng Zhang
1dbd127d1b Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf) 2026-05-23 04:37:52 +00:00
Qianfeng Zhang
86d8d72008 Use partition_index parameter for all get_x_indices_from_distributed_indices() calls 2026-05-22 15:08:30 +00:00
Qianfeng Zhang
65992be728 Update to the cross_attention test/bench scripts 2026-05-22 14:44:14 +00:00
Qianfeng Zhang
b1052e87e1 Add implementation of hstu fwd splitkv for softmax path 2026-05-22 09:54:27 +00:00
Qianfeng Zhang
8a7529177d Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack 2026-05-11 08:53:07 +00:00
Qianfeng Zhang
0a32eddc0a Re-format the .hpp/.cpp files using clang-format-18 2026-05-10 13:46:02 +00:00
Qianfeng Zhang
6981f148ee Fix potential bug in kernel host interface BlockSize() 2026-05-08 06:27:19 -04:00
Qianfeng Zhang
250f325c3a More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy 2026-05-07 06:32:01 -04:00
Qianfeng Zhang
888b6cad86 Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950 2026-04-30 14:06:06 +00:00
Qianfeng Zhang
4c583f0574 Add -fno-slp-vectorize option for building hstu kernels on gfx950 2026-04-30 13:37:22 +00:00
Qianfeng Zhang
7883f52d9f Use include <...> format to refer to header files from ck_tile 2026-04-27 10:15:14 +00:00
Qianfeng Zhang
d0803f263d Mark low probability branch as unlikely in the softmax pipelines 2026-04-27 07:24:03 +00:00
Qianfeng Zhang
b9d4be0982 Use type_convert to convert float constant to CompDataType 2026-04-24 15:46:26 +00:00
Qianfeng Zhang
1f2e2a272e Implement conditional softmax rescale in trload with_softmax pipeline 2026-04-24 09:54:12 +00:00
Qianfeng Zhang
90e718f73d Implement conditional softmax rescale in non-trload with_softmax pipeline 2026-04-24 09:22:44 +00:00
Qianfeng Zhang
d099819657 Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh 2026-04-24 07:36:41 +00:00
Qianfeng Zhang
0b6bbe45d6 Remove exposing kUseTrLoad as template parameter of pipeline problem 2026-04-21 15:35:03 +00:00
Qianfeng Zhang
8f0f7ca436 Simplification in the cross_attention testing/benchmarking scripts 2026-04-17 09:38:41 +00:00
Qianfeng Zhang
3f9f2fa736 Remove max_target 3200 cases from cross_attention testing and benchmarking 2026-04-17 09:17:38 +00:00
Qianfeng Zhang
db3263469c Clarify the using the max_seqlen and max_seqlen_q 2026-04-17 09:13:45 +00:00
Qianfeng Zhang
5c84f54fd9 Add scripts for testing/benchmarking cross_attention cases 2026-04-16 15:45:57 +00:00
Qianfeng Zhang
7889844d6b Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example 2026-04-16 07:11:55 +00:00
Qianfeng Zhang
9279af33f1 Add implementation of fwd splitkv on no_softmax path 2026-04-16 07:11:06 +00:00
Qianfeng Zhang
a95f64601d Remove dropout=true instances to reduce compiling-time 2026-04-07 09:38:18 +00:00
Qianfeng Zhang
348c3e05be Rename default_policy to policy for hstu_attention forward 2026-04-07 08:41:58 +00:00
Qianfeng Zhang
423cc72bc4 Move the calling of mask.GetTileRangeAlongX() to the kernel 2026-03-28 14:29:17 +00:00
Qianfeng Zhang
eefe426ef7 Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile() 2026-03-26 14:40:01 +00:00
Qianfeng Zhang
76da618c85 Enable run-time selection of MTile sizes according to the predicted CU utilization ratio 2026-03-20 05:36:44 +00:00
Qianfeng Zhang
302537c5a8 Update to support grouped mode hstu attention 2026-03-09 16:15:58 +00:00
Qianfeng Zhang
73d6e0eb67 Using in-place version of block_tile_reduce() so that using of m_local is avoided 2026-03-05 16:27:41 +00:00
Qianfeng Zhang
2be2c3cd11 Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane() 2026-02-21 14:59:00 +00:00
Qianfeng Zhang
f2a555dac7 Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts 2026-02-09 15:55:13 +00:00
Qianfeng Zhang
6f8b9548b5 Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument 2026-02-09 09:02:17 +00:00
Qianfeng Zhang
bdfa0a74c2 Update to hstu masking to separate the implementation for cross-attention and self-attention 2026-02-08 16:00:47 +00:00
Qianfeng Zhang
0711f4f90a Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention 2026-02-06 15:40:07 +00:00
Qianfeng Zhang
d169ed2194 Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950 2026-02-05 15:57:18 +00:00
Qianfeng Zhang
8af5e26717 Add softmax selection to two of the testing scripts 2026-02-05 15:27:15 +00:00