Qianfeng Zhang
|
c71cdbcad7
|
Some renaming in kernel and pipeline
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
1304e807fb
|
Update and fix for leeked changes and make the scripts be able to test/benchmark kStoreLSE cases
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
798fd3cd8b
|
Enable the kernel dispatching path from is_training & use_softmax to kStoreLSE
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
8b62d651a4
|
Add instances and kStoreLSE template in dispatcher class to support outputting lse for fwd training
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
c1e3b9be6a
|
Set lse tensor dim strides in example
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
0fdca0e940
|
Replace template kUseSoftmax/kStoreLSE by boolean parameters in reference fwd codes to save compiling time
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
8f091efc4c
|
Add support for outputing lse in the example and reference hstu attention forward implementation
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
cc184fc202
|
Add support for preparing lse_dram_window in hstu fwd kernel
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
75a3b5aab0
|
Kernel use types declared in the problem rather than the pipeline
|
2026-06-05 15:51:39 +00:00 |
|
Qianfeng Zhang
|
696d4534b2
|
Tiny simplification with defining the Bias related Kargs
|
2026-06-03 09:55:04 +00:00 |
|
Qianfeng Zhang
|
eba3c2f635
|
Add parameters used by storing lse in the fwd and fwd_splitkv_combine kernel to prepare for supporting training
|
2026-06-03 09:55:04 +00:00 |
|
Qianfeng Zhang
|
5ee8a37cd3
|
Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct
|
2026-06-03 09:55:04 +00:00 |
|
Qianfeng Zhang
|
36dd77fb16
|
Add kStoreLSE template parameter to the problems
|
2026-06-03 09:55:04 +00:00 |
|
Qianfeng Zhang
|
333abddbae
|
Rename the reference interfaces and the files
|
2026-05-28 08:07:54 +00:00 |
|
Qianfeng Zhang
|
e841981ddd
|
Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp
|
2026-05-23 07:24:00 +00:00 |
|
Qianfeng Zhang
|
1dbd127d1b
|
Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf)
|
2026-05-23 04:37:52 +00:00 |
|
Qianfeng Zhang
|
86d8d72008
|
Use partition_index parameter for all get_x_indices_from_distributed_indices() calls
|
2026-05-22 15:08:30 +00:00 |
|
Qianfeng Zhang
|
65992be728
|
Update to the cross_attention test/bench scripts
|
2026-05-22 14:44:14 +00:00 |
|
Qianfeng Zhang
|
b1052e87e1
|
Add implementation of hstu fwd splitkv for softmax path
|
2026-05-22 09:54:27 +00:00 |
|
Qianfeng Zhang
|
8a7529177d
|
Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack
|
2026-05-11 08:53:07 +00:00 |
|
Qianfeng Zhang
|
0a32eddc0a
|
Re-format the .hpp/.cpp files using clang-format-18
|
2026-05-10 13:46:02 +00:00 |
|
Qianfeng Zhang
|
6981f148ee
|
Fix potential bug in kernel host interface BlockSize()
|
2026-05-08 06:27:19 -04:00 |
|
Qianfeng Zhang
|
250f325c3a
|
More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy
|
2026-05-07 06:32:01 -04:00 |
|
Qianfeng Zhang
|
888b6cad86
|
Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950
|
2026-04-30 14:06:06 +00:00 |
|
Qianfeng Zhang
|
4c583f0574
|
Add -fno-slp-vectorize option for building hstu kernels on gfx950
|
2026-04-30 13:37:22 +00:00 |
|
Qianfeng Zhang
|
7883f52d9f
|
Use include <...> format to refer to header files from ck_tile
|
2026-04-27 10:15:14 +00:00 |
|
Qianfeng Zhang
|
d0803f263d
|
Mark low probability branch as unlikely in the softmax pipelines
|
2026-04-27 07:24:03 +00:00 |
|
Qianfeng Zhang
|
b9d4be0982
|
Use type_convert to convert float constant to CompDataType
|
2026-04-24 15:46:26 +00:00 |
|
Qianfeng Zhang
|
1f2e2a272e
|
Implement conditional softmax rescale in trload with_softmax pipeline
|
2026-04-24 09:54:12 +00:00 |
|
Qianfeng Zhang
|
90e718f73d
|
Implement conditional softmax rescale in non-trload with_softmax pipeline
|
2026-04-24 09:22:44 +00:00 |
|
Qianfeng Zhang
|
d099819657
|
Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh
|
2026-04-24 07:36:41 +00:00 |
|
Qianfeng Zhang
|
0b6bbe45d6
|
Remove exposing kUseTrLoad as template parameter of pipeline problem
|
2026-04-21 15:35:03 +00:00 |
|
Qianfeng Zhang
|
8f0f7ca436
|
Simplification in the cross_attention testing/benchmarking scripts
|
2026-04-17 09:38:41 +00:00 |
|
Qianfeng Zhang
|
3f9f2fa736
|
Remove max_target 3200 cases from cross_attention testing and benchmarking
|
2026-04-17 09:17:38 +00:00 |
|
Qianfeng Zhang
|
db3263469c
|
Clarify the using the max_seqlen and max_seqlen_q
|
2026-04-17 09:13:45 +00:00 |
|
Qianfeng Zhang
|
5c84f54fd9
|
Add scripts for testing/benchmarking cross_attention cases
|
2026-04-16 15:45:57 +00:00 |
|
Qianfeng Zhang
|
7889844d6b
|
Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example
|
2026-04-16 07:11:55 +00:00 |
|
Qianfeng Zhang
|
9279af33f1
|
Add implementation of fwd splitkv on no_softmax path
|
2026-04-16 07:11:06 +00:00 |
|
Qianfeng Zhang
|
a95f64601d
|
Remove dropout=true instances to reduce compiling-time
|
2026-04-07 09:38:18 +00:00 |
|
Qianfeng Zhang
|
348c3e05be
|
Rename default_policy to policy for hstu_attention forward
|
2026-04-07 08:41:58 +00:00 |
|
Qianfeng Zhang
|
423cc72bc4
|
Move the calling of mask.GetTileRangeAlongX() to the kernel
|
2026-03-28 14:29:17 +00:00 |
|
Qianfeng Zhang
|
eefe426ef7
|
Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()
|
2026-03-26 14:40:01 +00:00 |
|
Qianfeng Zhang
|
76da618c85
|
Enable run-time selection of MTile sizes according to the predicted CU utilization ratio
|
2026-03-20 05:36:44 +00:00 |
|
Qianfeng Zhang
|
302537c5a8
|
Update to support grouped mode hstu attention
|
2026-03-09 16:15:58 +00:00 |
|
Qianfeng Zhang
|
73d6e0eb67
|
Using in-place version of block_tile_reduce() so that using of m_local is avoided
|
2026-03-05 16:27:41 +00:00 |
|
Qianfeng Zhang
|
2be2c3cd11
|
Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()
|
2026-02-21 14:59:00 +00:00 |
|
Qianfeng Zhang
|
f2a555dac7
|
Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts
|
2026-02-09 15:55:13 +00:00 |
|
Qianfeng Zhang
|
6f8b9548b5
|
Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument
|
2026-02-09 09:02:17 +00:00 |
|
Qianfeng Zhang
|
bdfa0a74c2
|
Update to hstu masking to separate the implementation for cross-attention and self-attention
|
2026-02-08 16:00:47 +00:00 |
|
Qianfeng Zhang
|
0711f4f90a
|
Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention
|
2026-02-06 15:40:07 +00:00 |
|