Qianfeng Zhang
|
5ee8a37cd3
|
Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct
|
2026-06-03 09:55:04 +00:00 |
|
Qianfeng Zhang
|
36dd77fb16
|
Add kStoreLSE template parameter to the problems
|
2026-06-03 09:55:04 +00:00 |
|
Qianfeng Zhang
|
333abddbae
|
Rename the reference interfaces and the files
|
2026-05-28 08:07:54 +00:00 |
|
Qianfeng Zhang
|
e841981ddd
|
Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp
|
2026-05-23 07:24:00 +00:00 |
|
Qianfeng Zhang
|
1dbd127d1b
|
Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf)
|
2026-05-23 04:37:52 +00:00 |
|
Qianfeng Zhang
|
86d8d72008
|
Use partition_index parameter for all get_x_indices_from_distributed_indices() calls
|
2026-05-22 15:08:30 +00:00 |
|
Qianfeng Zhang
|
65992be728
|
Update to the cross_attention test/bench scripts
|
2026-05-22 14:44:14 +00:00 |
|
Qianfeng Zhang
|
b1052e87e1
|
Add implementation of hstu fwd splitkv for softmax path
|
2026-05-22 09:54:27 +00:00 |
|
Qianfeng Zhang
|
8a7529177d
|
Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack
|
2026-05-11 08:53:07 +00:00 |
|
Qianfeng Zhang
|
0a32eddc0a
|
Re-format the .hpp/.cpp files using clang-format-18
|
2026-05-10 13:46:02 +00:00 |
|
Qianfeng Zhang
|
6981f148ee
|
Fix potential bug in kernel host interface BlockSize()
|
2026-05-08 06:27:19 -04:00 |
|
Qianfeng Zhang
|
250f325c3a
|
More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy
|
2026-05-07 06:32:01 -04:00 |
|
Qianfeng Zhang
|
888b6cad86
|
Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950
|
2026-04-30 14:06:06 +00:00 |
|
Qianfeng Zhang
|
4c583f0574
|
Add -fno-slp-vectorize option for building hstu kernels on gfx950
|
2026-04-30 13:37:22 +00:00 |
|
Qianfeng Zhang
|
7883f52d9f
|
Use include <...> format to refer to header files from ck_tile
|
2026-04-27 10:15:14 +00:00 |
|
Qianfeng Zhang
|
d0803f263d
|
Mark low probability branch as unlikely in the softmax pipelines
|
2026-04-27 07:24:03 +00:00 |
|
Qianfeng Zhang
|
b9d4be0982
|
Use type_convert to convert float constant to CompDataType
|
2026-04-24 15:46:26 +00:00 |
|
Qianfeng Zhang
|
1f2e2a272e
|
Implement conditional softmax rescale in trload with_softmax pipeline
|
2026-04-24 09:54:12 +00:00 |
|
Qianfeng Zhang
|
90e718f73d
|
Implement conditional softmax rescale in non-trload with_softmax pipeline
|
2026-04-24 09:22:44 +00:00 |
|
Qianfeng Zhang
|
d099819657
|
Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh
|
2026-04-24 07:36:41 +00:00 |
|
Qianfeng Zhang
|
0b6bbe45d6
|
Remove exposing kUseTrLoad as template parameter of pipeline problem
|
2026-04-21 15:35:03 +00:00 |
|
Qianfeng Zhang
|
8f0f7ca436
|
Simplification in the cross_attention testing/benchmarking scripts
|
2026-04-17 09:38:41 +00:00 |
|
Qianfeng Zhang
|
3f9f2fa736
|
Remove max_target 3200 cases from cross_attention testing and benchmarking
|
2026-04-17 09:17:38 +00:00 |
|
Qianfeng Zhang
|
db3263469c
|
Clarify the using the max_seqlen and max_seqlen_q
|
2026-04-17 09:13:45 +00:00 |
|
Qianfeng Zhang
|
5c84f54fd9
|
Add scripts for testing/benchmarking cross_attention cases
|
2026-04-16 15:45:57 +00:00 |
|
Qianfeng Zhang
|
7889844d6b
|
Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example
|
2026-04-16 07:11:55 +00:00 |
|
Qianfeng Zhang
|
9279af33f1
|
Add implementation of fwd splitkv on no_softmax path
|
2026-04-16 07:11:06 +00:00 |
|
Qianfeng Zhang
|
a95f64601d
|
Remove dropout=true instances to reduce compiling-time
|
2026-04-07 09:38:18 +00:00 |
|
Qianfeng Zhang
|
348c3e05be
|
Rename default_policy to policy for hstu_attention forward
|
2026-04-07 08:41:58 +00:00 |
|
Qianfeng Zhang
|
423cc72bc4
|
Move the calling of mask.GetTileRangeAlongX() to the kernel
|
2026-03-28 14:29:17 +00:00 |
|
Qianfeng Zhang
|
eefe426ef7
|
Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()
|
2026-03-26 14:40:01 +00:00 |
|
Qianfeng Zhang
|
76da618c85
|
Enable run-time selection of MTile sizes according to the predicted CU utilization ratio
|
2026-03-20 05:36:44 +00:00 |
|
Qianfeng Zhang
|
302537c5a8
|
Update to support grouped mode hstu attention
|
2026-03-09 16:15:58 +00:00 |
|
Qianfeng Zhang
|
73d6e0eb67
|
Using in-place version of block_tile_reduce() so that using of m_local is avoided
|
2026-03-05 16:27:41 +00:00 |
|
Qianfeng Zhang
|
2be2c3cd11
|
Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()
|
2026-02-21 14:59:00 +00:00 |
|
Qianfeng Zhang
|
f2a555dac7
|
Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts
|
2026-02-09 15:55:13 +00:00 |
|
Qianfeng Zhang
|
6f8b9548b5
|
Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument
|
2026-02-09 09:02:17 +00:00 |
|
Qianfeng Zhang
|
bdfa0a74c2
|
Update to hstu masking to separate the implementation for cross-attention and self-attention
|
2026-02-08 16:00:47 +00:00 |
|
Qianfeng Zhang
|
0711f4f90a
|
Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention
|
2026-02-06 15:40:07 +00:00 |
|
Qianfeng Zhang
|
d169ed2194
|
Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950
|
2026-02-05 15:57:18 +00:00 |
|
Qianfeng Zhang
|
8af5e26717
|
Add softmax selection to two of the testing scripts
|
2026-02-05 15:27:15 +00:00 |
|
Qianfeng Zhang
|
0a8c5f523a
|
[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling
|
2026-02-02 15:59:38 +00:00 |
|
Qianfeng Zhang
|
c360e0cbc4
|
Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256
|
2026-01-30 10:02:31 +00:00 |
|
Qianfeng Zhang
|
749e83f2fd
|
Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q
|
2026-01-26 13:45:42 +00:00 |
|
Qianfeng Zhang
|
1d4d925ba3
|
Fix in K-LdsBuffer and V-LdsBuffer over-lap checking
|
2025-12-27 05:43:11 +00:00 |
|
Qianfeng Zhang
|
d2dadc22a7
|
Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop
|
2025-12-26 15:38:52 +00:00 |
|
Qianfeng Zhang
|
df902c6a06
|
Tiny fix in using v_tiles[] index
|
2025-12-25 15:37:22 +00:00 |
|
Qianfeng Zhang
|
2d53d67b6d
|
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving
|
2025-12-25 14:58:09 +00:00 |
|
Qianfeng Zhang
|
ddf0f1c8ed
|
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving
|
2025-12-25 14:30:57 +00:00 |
|
Qianfeng Zhang
|
02cae85af5
|
Load Q directly from global memory to registers for BlockGemm
|
2025-12-20 14:08:55 +00:00 |
|