Qianfeng Zhang
423cc72bc4
Move the calling of mask.GetTileRangeAlongX() to the kernel
2026-03-28 14:29:17 +00:00
Qianfeng Zhang
eefe426ef7
Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()
2026-03-26 14:40:01 +00:00
Qianfeng Zhang
76da618c85
Enable run-time selection of MTile sizes according to the predicted CU utilization ratio
2026-03-20 05:36:44 +00:00
Qianfeng Zhang
302537c5a8
Update to support grouped mode hstu attention
2026-03-09 16:15:58 +00:00
Qianfeng Zhang
73d6e0eb67
Using in-place version of block_tile_reduce() so that using of m_local is avoided
2026-03-05 16:27:41 +00:00
Qianfeng Zhang
2be2c3cd11
Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()
2026-02-21 14:59:00 +00:00
Qianfeng Zhang
f2a555dac7
Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts
2026-02-09 15:55:13 +00:00
Qianfeng Zhang
6f8b9548b5
Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument
2026-02-09 09:02:17 +00:00
Qianfeng Zhang
bdfa0a74c2
Update to hstu masking to separate the implementation for cross-attention and self-attention
2026-02-08 16:00:47 +00:00
Qianfeng Zhang
0711f4f90a
Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention
2026-02-06 15:40:07 +00:00
Qianfeng Zhang
d169ed2194
Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950
2026-02-05 15:57:18 +00:00
Qianfeng Zhang
8af5e26717
Add softmax selection to two of the testing scripts
2026-02-05 15:27:15 +00:00
Qianfeng Zhang
0a8c5f523a
[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling
2026-02-02 15:59:38 +00:00
Qianfeng Zhang
c360e0cbc4
Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256
2026-01-30 10:02:31 +00:00
Qianfeng Zhang
749e83f2fd
Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q
2026-01-26 13:45:42 +00:00
Qianfeng Zhang
1d4d925ba3
Fix in K-LdsBuffer and V-LdsBuffer over-lap checking
2025-12-27 05:43:11 +00:00
Qianfeng Zhang
d2dadc22a7
Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop
2025-12-26 15:38:52 +00:00
Qianfeng Zhang
df902c6a06
Tiny fix in using v_tiles[] index
2025-12-25 15:37:22 +00:00
Qianfeng Zhang
2d53d67b6d
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving
2025-12-25 14:58:09 +00:00
Qianfeng Zhang
ddf0f1c8ed
Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving
2025-12-25 14:30:57 +00:00
Qianfeng Zhang
02cae85af5
Load Q directly from global memory to registers for BlockGemm
2025-12-20 14:08:55 +00:00
Qianfeng Zhang
3d90b5f90e
Remove un-used including from default policy file
2025-12-19 10:13:41 +00:00
Qianfeng Zhang
9e47664092
Move common codes to detail namespace from Problem class scope
2025-12-17 10:37:21 +00:00
Qianfeng Zhang
89daa890d1
Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)
2025-12-17 07:47:17 +00:00
Qianfeng Zhang
1cf868026b
Add support of loading QK tiles of hdim96 without padding to hdim128
2025-12-16 16:39:40 +00:00
Qianfeng Zhang
588f573ee1
Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile
2025-12-16 15:03:57 +00:00
Qianfeng Zhang
179f0e857e
Rename WarpTile in fwd setting
2025-12-14 16:40:52 +00:00
Qianfeng Zhang
125934a966
Simplifying the codes in defining KDram and QDram tile distribution
2025-12-14 14:23:56 +00:00
Qianfeng Zhang
1ab5e9da93
Tiny update in GetMaxVectorSize()
2025-12-14 04:43:02 +00:00
Qianfeng Zhang
f79a29ac80
Rename and add scripts for testing hdim96
2025-12-12 16:16:43 +00:00
Qianfeng Zhang
b3d54477f1
Enable hdim96 instances
2025-12-12 16:16:23 +00:00
Qianfeng Zhang
18108d0d54
Fix with regard to define stride in MakeKLdsBlockDescriptor()
2025-12-12 09:55:53 +00:00
Qianfeng Zhang
db39b44bab
Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV
2025-12-11 10:47:54 +00:00
Qianfeng Zhang
8640ffe8eb
Further correction with regard to using n0_loops and k1_loops
2025-12-08 16:03:56 +00:00
Qianfeng Zhang
641dae10e8
Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning
2025-12-08 13:07:42 +00:00
Qianfeng Zhang
3a89eb8857
Simplify the codes in block_gemm
2025-12-06 15:45:38 +00:00
Qianfeng Zhang
4731c8e519
Further clarification in using kSubQKHeaddim and kQKHeaddim
2025-12-03 09:46:44 +00:00
Qianfeng Zhang
2549bc1fee
Clarify the using of kSubQKHeaddim and kQKHeaddim
2025-12-03 08:57:57 +00:00
Qianfeng Zhang
7234b2fc1a
Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines
2025-12-01 14:58:02 +00:00
Qianfeng Zhang
c1817464be
Tiny fix in GetQKBlockGemm
2025-11-30 14:04:48 +00:00
Qianfeng Zhang
f01e0ef37d
Enable the using of WarpTile-32x32x16 and add scripts to verify
2025-11-30 04:58:28 +00:00
Qianfeng Zhang
d99493606e
Add static_assert and comments in the with_softmax pipelines
2025-11-28 15:19:33 +00:00
Qianfeng Zhang
f952d3571c
Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950
2025-11-28 14:02:16 +00:00
Qianfeng Zhang
a0e4315d4e
Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline
2025-11-27 15:30:53 +00:00
Qianfeng Zhang
69c97c06d7
Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md
2025-11-27 08:27:52 +00:00
Qianfeng Zhang
f9e8c5539f
Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window
2025-11-23 04:49:01 +00:00
Qianfeng Zhang
4f33eb5857
Merge branch 'develop' into hstu_attention_mi350_fwd_bwd
2025-11-23 04:20:53 +00:00
Emily Martins
02ab76c2cb
Fix CK Tile DP + 2 Tile Stream-K Validation Errors ( #3269 )
...
When there are multiple workgroups contributing to a tile, when using
atomics, there may be round off error in cases where the accumulator
type is not the same as the C type. To compute an error tolerance for
test validation, the Stream-K Tile Partitioner has a function called
estimate_num_wgs_per_tile to estimate the number of workgroups per tile.
That said, this function only provides an estimate. In some cases for
DP+2TSK, the function returns 1 rather than the more accurate value of
2.
Thus, this change updates the estimate_num_wgs_per_tile function to
explicitely return the value of 2 in cases for DP+2TSK to ensure that we
have a better error tolerance to avoid test failures due to round-off
error.
2025-11-21 20:29:47 -07:00
Illia Silin
21ae743acd
Enable daily builds on gfx1010 ( #3258 )
...
* add build/test on gfx1010
* only build and run on gfx1010 once daily
2025-11-21 07:22:01 -08:00
John Shumway
ea6e4fcbbc
Fix builder errors. ( #3260 )
...
There were four errors to fix:
1. The checks for defaulted direction were not implemented in the predicate concept.
2. Had to delete an obsolete and undefined operation enum.
3. A factory was passing a boolean in place of an integer.
4. Some of the factory tests are not compiling correctly when linking in the full source (with CK_EXPERIMENTAL_BUILDER=ON), so I commented them out.
2025-11-21 15:25:45 +01:00