Qianfeng Zhang
|
b9d4be0982
|
Use type_convert to convert float constant to CompDataType
|
2026-04-24 15:46:26 +00:00 |
|
Qianfeng Zhang
|
1f2e2a272e
|
Implement conditional softmax rescale in trload with_softmax pipeline
|
2026-04-24 09:54:12 +00:00 |
|
Qianfeng Zhang
|
90e718f73d
|
Implement conditional softmax rescale in non-trload with_softmax pipeline
|
2026-04-24 09:22:44 +00:00 |
|
Qianfeng Zhang
|
d099819657
|
Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh
|
2026-04-24 07:36:41 +00:00 |
|
Qianfeng Zhang
|
0b6bbe45d6
|
Remove exposing kUseTrLoad as template parameter of pipeline problem
|
2026-04-21 15:35:03 +00:00 |
|
Qianfeng Zhang
|
8f0f7ca436
|
Simplification in the cross_attention testing/benchmarking scripts
|
2026-04-17 09:38:41 +00:00 |
|
Qianfeng Zhang
|
3f9f2fa736
|
Remove max_target 3200 cases from cross_attention testing and benchmarking
|
2026-04-17 09:17:38 +00:00 |
|
Qianfeng Zhang
|
db3263469c
|
Clarify the using the max_seqlen and max_seqlen_q
|
2026-04-17 09:13:45 +00:00 |
|
Qianfeng Zhang
|
5c84f54fd9
|
Add scripts for testing/benchmarking cross_attention cases
|
2026-04-16 15:45:57 +00:00 |
|
Qianfeng Zhang
|
7889844d6b
|
Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example
|
2026-04-16 07:11:55 +00:00 |
|
Qianfeng Zhang
|
9279af33f1
|
Add implementation of fwd splitkv on no_softmax path
|
2026-04-16 07:11:06 +00:00 |
|
Qianfeng Zhang
|
a95f64601d
|
Remove dropout=true instances to reduce compiling-time
|
2026-04-07 09:38:18 +00:00 |
|
Qianfeng Zhang
|
348c3e05be
|
Rename default_policy to policy for hstu_attention forward
|
2026-04-07 08:41:58 +00:00 |
|
Qianfeng Zhang
|
423cc72bc4
|
Move the calling of mask.GetTileRangeAlongX() to the kernel
|
2026-03-28 14:29:17 +00:00 |
|
Qianfeng Zhang
|
eefe426ef7
|
Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()
|
2026-03-26 14:40:01 +00:00 |
|
Qianfeng Zhang
|
76da618c85
|
Enable run-time selection of MTile sizes according to the predicted CU utilization ratio
|
2026-03-20 05:36:44 +00:00 |
|
Qianfeng Zhang
|
302537c5a8
|
Update to support grouped mode hstu attention
|
2026-03-09 16:15:58 +00:00 |
|
Qianfeng Zhang
|
73d6e0eb67
|
Using in-place version of block_tile_reduce() so that using of m_local is avoided
|
2026-03-05 16:27:41 +00:00 |
|
Qianfeng Zhang
|
2be2c3cd11
|
Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()
|
2026-02-21 14:59:00 +00:00 |
|
Qianfeng Zhang
|
f2a555dac7
|
Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts
|
2026-02-09 15:55:13 +00:00 |
|
Qianfeng Zhang
|
6f8b9548b5
|
Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument
|
2026-02-09 09:02:17 +00:00 |
|
Qianfeng Zhang
|
bdfa0a74c2
|
Update to hstu masking to separate the implementation for cross-attention and self-attention
|
2026-02-08 16:00:47 +00:00 |
|
Qianfeng Zhang
|
0711f4f90a
|
Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention
|
2026-02-06 15:40:07 +00:00 |
|
Qianfeng Zhang
|
d169ed2194
|
Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950
|
2026-02-05 15:57:18 +00:00 |
|
Qianfeng Zhang
|
8af5e26717
|
Add softmax selection to two of the testing scripts
|
2026-02-05 15:27:15 +00:00 |
|
Qianfeng Zhang
|
0a8c5f523a
|
[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling
|
2026-02-02 15:59:38 +00:00 |
|
Qianfeng Zhang
|
c360e0cbc4
|
Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256
|
2026-01-30 10:02:31 +00:00 |
|
Qianfeng Zhang
|
749e83f2fd
|
Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q
|
2026-01-26 13:45:42 +00:00 |
|
Qianfeng Zhang
|
1d4d925ba3
|
Fix in K-LdsBuffer and V-LdsBuffer over-lap checking
|
2025-12-27 05:43:11 +00:00 |
|
Qianfeng Zhang
|
d2dadc22a7
|
Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop
|
2025-12-26 15:38:52 +00:00 |
|
Qianfeng Zhang
|
df902c6a06
|
Tiny fix in using v_tiles[] index
|
2025-12-25 15:37:22 +00:00 |
|
Qianfeng Zhang
|
2d53d67b6d
|
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving
|
2025-12-25 14:58:09 +00:00 |
|
Qianfeng Zhang
|
ddf0f1c8ed
|
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving
|
2025-12-25 14:30:57 +00:00 |
|
Qianfeng Zhang
|
02cae85af5
|
Load Q directly from global memory to registers for BlockGemm
|
2025-12-20 14:08:55 +00:00 |
|
Qianfeng Zhang
|
3d90b5f90e
|
Remove un-used including from default policy file
|
2025-12-19 10:13:41 +00:00 |
|
Qianfeng Zhang
|
9e47664092
|
Move common codes to detail namespace from Problem class scope
|
2025-12-17 10:37:21 +00:00 |
|
Qianfeng Zhang
|
89daa890d1
|
Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)
|
2025-12-17 07:47:17 +00:00 |
|
Qianfeng Zhang
|
1cf868026b
|
Add support of loading QK tiles of hdim96 without padding to hdim128
|
2025-12-16 16:39:40 +00:00 |
|
Qianfeng Zhang
|
588f573ee1
|
Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile
|
2025-12-16 15:03:57 +00:00 |
|
Qianfeng Zhang
|
179f0e857e
|
Rename WarpTile in fwd setting
|
2025-12-14 16:40:52 +00:00 |
|
Qianfeng Zhang
|
125934a966
|
Simplifying the codes in defining KDram and QDram tile distribution
|
2025-12-14 14:23:56 +00:00 |
|
Qianfeng Zhang
|
1ab5e9da93
|
Tiny update in GetMaxVectorSize()
|
2025-12-14 04:43:02 +00:00 |
|
Qianfeng Zhang
|
f79a29ac80
|
Rename and add scripts for testing hdim96
|
2025-12-12 16:16:43 +00:00 |
|
Qianfeng Zhang
|
b3d54477f1
|
Enable hdim96 instances
|
2025-12-12 16:16:23 +00:00 |
|
Qianfeng Zhang
|
18108d0d54
|
Fix with regard to define stride in MakeKLdsBlockDescriptor()
|
2025-12-12 09:55:53 +00:00 |
|
Qianfeng Zhang
|
db39b44bab
|
Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV
|
2025-12-11 10:47:54 +00:00 |
|
Qianfeng Zhang
|
8640ffe8eb
|
Further correction with regard to using n0_loops and k1_loops
|
2025-12-08 16:03:56 +00:00 |
|
Qianfeng Zhang
|
641dae10e8
|
Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning
|
2025-12-08 13:07:42 +00:00 |
|
Qianfeng Zhang
|
3a89eb8857
|
Simplify the codes in block_gemm
|
2025-12-06 15:45:38 +00:00 |
|
Qianfeng Zhang
|
4731c8e519
|
Further clarification in using kSubQKHeaddim and kQKHeaddim
|
2025-12-03 09:46:44 +00:00 |
|