Qianfeng Zhang
|
e4e22cb2d9
|
Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
6ac36b9459
|
Tiny fix in GetQKBlockGemm
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ff54459c23
|
Enable the using of WarpTile-32x32x16 and add scripts to verify
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
c4fc7b28c8
|
Add static_assert and comments in the with_softmax pipelines
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
c98688d5ad
|
Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
fa5b077b91
|
Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
311fb8a379
|
Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
cf2172b580
|
Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
45590fb8b1
|
Remove useless codes in the two trload pipelines
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ec9d1fe253
|
Separate Traits from Problem while being used for defining the pipeline
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
7655ffd2fb
|
Remove the k_element_func and v_element_func from the pipeline since they are not used
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
2ebe159050
|
Update to the two trload pipeline to load whole Q-tile once through LDS on mi350
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
0253442ef9
|
Simplify the codes in block_gemm_areg_bsmem_creg_v2_hack_1
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
0e1444a73c
|
Simplify the codes in block_gemm_areg_bsmem_trload_creg
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
76b515e092
|
[Performance] Change the tile settings for mi350/trload no_softmax pipeline to enable to use mfma-16x16x32 for Gemm-1
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
f608f29401
|
Improve the softmax+trload pipeline by using kN0=64 and prefetch only two k tiles
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
7d9a95dbe0
|
Tiny fix in trload with_softmax/no_softmax pipeline
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
e7bfee6043
|
Improve both the with_softmax and no_softmax pipelines
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
3e122a551d
|
Add kUseTrLoad = false in non-trload pipeline
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
660604635d
|
Clarifying the using of CK_TILE_HOST and CK_TILE_HOST_DEVICE trying to save compiling time
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
3565656005
|
Change in updating max_uih_seqlen in the example
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
639332b5de
|
Use supplement_array_by_last_element(num_targets, ) in example
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
19a887cffa
|
Use supplement_array_by_last_element() in example to simplify the codes
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
9537b958ab
|
Update to README.md
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
2ae759c5b3
|
Add scripts for testing the using of separate sequence lengths for k/v
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
a926db031a
|
Support separate sequence lengths for q and kv
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
bcefc197d8
|
Use separate pipelines for using or not-using softmax situations
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
4b3901e989
|
Implementation of hstu attention pipeline using trload for v on mi350
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
5db875df2c
|
Fix in the comments
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
563229a9e4
|
Update to gemm_0's CBlockDistribution encoding so that it is compatible with gemm_1's ABlockDistribution encoding
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
4e9d4f7487
|
Using separate tile settings for no-softmax and with-softmax hstu attention situations
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
91afdfee40
|
Update to benchmark scripts to consider for using softmax
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
be3b27edd3
|
Add support of softmax in hstu attention
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
ec4f174ac4
|
Add template parameter to gemm_0 MakeCBlockTile() for the need of defining PcompBlockTileType
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
da50eea674
|
Move scaling by attn_scale to inside the main-loop
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
3639fb8e38
|
Let IsTokenPairInsideMask() return bool type
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
a71c996049
|
Add instances to consider for adding softmax support
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
0abb52004a
|
Remove K0 from tile setting since it is not used
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
de198549ad
|
Change to pipeline so that it is easier to add support of using softmax
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
7e79736df7
|
Remove using IGLP method for instruction scheduling for kUseLocal true path
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
08f50c2c51
|
Fix in GetQKBlockGemm()
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
5199dbd027
|
Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
b0710e8871
|
Remove useless constant statement in the kernel
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
2c861d541f
|
Remove un-necessary HSTU_CHECK() callings
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
bc5616f1dc
|
Add HSTU_CHECK() and use it in example codes
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
f7f90c539e
|
Smalle update in reference hstu attention
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
9c4d76d96b
|
Detach HstuBlockMask from pipeline definition and construct the HstuBlockMask type in the kernel according to window_size
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
8313b34543
|
Unify the license statements on all the source files
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
2668bb3aee
|
Remove using MakeKargsImpl() to simplify the hstu kernel
|
2026-06-23 09:27:58 +00:00 |
|
Qianfeng Zhang
|
a8c62920bf
|
Clarify the using of kSubQKHeaddim and kQKHeaddim so that less regular hdim (eg. 96, 160) can be efficiently supported
|
2026-06-23 09:27:58 +00:00 |
|