Qianfeng Zhang
|
ee5bd0ebba
|
Tiny simplification with defining the Bias related Kargs
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
f41b0176d3
|
Add parameters used by storing lse in the fwd and fwd_splitkv_combine kernel to prepare for supporting training
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
270d073c88
|
Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
b2561b88e4
|
Add kStoreLSE template parameter to the problems
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
2a86bfb6f5
|
Implement host reference operator for hstu attention backward
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
ec58f92f05
|
Rename the reference interfaces and the files
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
7d317adf37
|
Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
30b5d7bd01
|
Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf)
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
9a7cc5b4a3
|
Use partition_index parameter for all get_x_indices_from_distributed_indices() calls
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
9fbe96ab76
|
Update to the cross_attention test/bench scripts
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
24329f15d1
|
Add implementation of hstu fwd splitkv for softmax path
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d2bc927242
|
Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ea5af27b62
|
Re-format the .hpp/.cpp files using clang-format-18
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
bd8a87301b
|
Fix potential bug in kernel host interface BlockSize()
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f73d7d2f8a
|
More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
e48bcff488
|
Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
c6dfe030d0
|
Add -fno-slp-vectorize option for building hstu kernels on gfx950
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
1f4319ce91
|
Use include <...> format to refer to header files from ck_tile
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
a97c7a75ce
|
Mark low probability branch as unlikely in the softmax pipelines
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
68bbcac775
|
Use type_convert to convert float constant to CompDataType
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d243b275da
|
Implement conditional softmax rescale in trload with_softmax pipeline
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
671c65e178
|
Implement conditional softmax rescale in non-trload with_softmax pipeline
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
cfbd8a342a
|
Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
62bf2296c6
|
Remove exposing kUseTrLoad as template parameter of pipeline problem
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
67f9461b42
|
Simplification in the cross_attention testing/benchmarking scripts
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
bd46155431
|
Remove max_target 3200 cases from cross_attention testing and benchmarking
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f99ed6225b
|
Clarify the using the max_seqlen and max_seqlen_q
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
6f2a73b17d
|
Add scripts for testing/benchmarking cross_attention cases
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
8a7f0a8e99
|
Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
c0922a6cb8
|
Add implementation of fwd splitkv on no_softmax path
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
410f472a33
|
Remove dropout=true instances to reduce compiling-time
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
686125c0cd
|
Rename default_policy to policy for hstu_attention forward
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
27d01448d0
|
Move the calling of mask.GetTileRangeAlongX() to the kernel
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f99d6c4112
|
Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ac9a142e63
|
Enable run-time selection of MTile sizes according to the predicted CU utilization ratio
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
0462d44215
|
Update to support grouped mode hstu attention
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
4781582b0f
|
Using in-place version of block_tile_reduce() so that using of m_local is avoided
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d206045c53
|
Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
57d837977b
|
Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
09656528a6
|
Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d55d6a19a7
|
Update to hstu masking to separate the implementation for cross-attention and self-attention
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
3b674ee8c9
|
Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
d8f7e5a791
|
Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
7d110e3872
|
Add softmax selection to two of the testing scripts
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
1e199d0641
|
[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
24d6e49323
|
Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ea9e4d8e00
|
Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
cf971db352
|
Fix in K-LdsBuffer and V-LdsBuffer over-lap checking
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
bcedecce4f
|
Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
500f2245f7
|
Tiny fix in using v_tiles[] index
|
2026-06-23 09:27:59 +00:00 |
|