Commit Graph

601 Commits

Author SHA1 Message Date
Qianfeng Zhang
2bb59dff0a Add two scripts 2025-06-06 08:15:25 +00:00
Qianfeng Zhang
9582ae2dff Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes 2025-05-31 06:22:54 +00:00
Qianfeng Zhang
bec35abafe Tune the settings for hdim-256 2025-05-30 08:51:49 +00:00
Qianfeng Zhang
832747c58d Add example parameter alpha to ease the testing 2025-05-30 08:51:24 +00:00
Qianfeng Zhang
781cba355a Convert P to fp16/bf16 before doing second gemm in reference hstu implementation 2025-05-29 01:07:04 +00:00
Qianfeng Zhang
36a0f2020c not-critical updates in example and block_masking codes 2025-05-29 01:06:32 +00:00
Qianfeng Zhang
68a5ab8ff8 Add init_qkv and dump_output example parameters for easier debugging 2025-05-29 01:05:49 +00:00
Qianfeng Zhang
10c35125d2 Add example parameter max_seqlen and max_target 2025-05-27 14:43:43 +00:00
Qianfeng Zhang
c9e19351c7 Update to the method for calculating max_seqlen in the example 2025-05-27 10:38:52 +00:00
Qianfeng Zhang
dc0977faad Use NRepetitions2DEpilogue for outputing o_acc tile 2025-05-26 14:43:30 +00:00
Qianfeng Zhang
81f7b139e0 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting 2025-05-22 09:49:11 +00:00
Qianfeng Zhang
a1346aaf3e Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm 2025-05-20 07:53:25 +00:00
Qianfeng Zhang
0a8ea6bd02 Adjust the threshold values for fp16/bf16 in the example 2025-05-20 07:52:37 +00:00
Qianfeng Zhang
29cf1610f1 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt 2025-05-20 07:52:01 +00:00
Qianfeng Zhang
fac03abbda Change do-while main-loop to while-do and remove early exiting check 2025-05-19 15:39:31 +00:00
Qianfeng Zhang
14ab6f154d Adjust the codes before the main-loop 2025-05-19 13:59:57 +00:00
Qianfeng Zhang
f411d676f2 Move k_tile loading and v_tile loading earlier in the loop 2025-05-19 13:59:24 +00:00
Qianfeng Zhang
902b1c645c Move k_tile loading in the loop earlier 2025-05-19 13:58:43 +00:00
Qianfeng Zhang
f582c21418 Replace s_acc and pcomp tile array by single tile object for simplification 2025-05-19 13:57:11 +00:00
Qianfeng Zhang
4e65469fe8 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation 2025-05-18 16:20:47 +00:00
Qianfeng Zhang
e4e70f8b0a Set the block_per_cu to 3 for hdim-128 2025-05-18 15:58:57 +00:00
Qianfeng Zhang
ff3415d97d Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 2025-05-18 15:26:31 +00:00
Qianfeng Zhang
694295a9d3 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:43 +00:00
Qianfeng Zhang
afd7793e92 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:05 +00:00
Qianfeng Zhang
7c0ac51b4b Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 2025-05-18 15:24:17 +00:00
Qianfeng Zhang
473fbc374b Rename the hacked block_gemm_areg_bsmem_creg_v2 2025-05-18 15:23:40 +00:00
Qianfeng Zhang
58e45ec53a Move the lambda for dividing by max_seqlen from kernel to pipeline 2025-05-18 07:58:52 +00:00
Qianfeng Zhang
0771390a28 Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop 2025-05-18 03:41:31 +00:00
Qianfeng Zhang
586968785a Set example option -save_mask default to 0 2025-05-14 13:56:54 +00:00
Qianfeng Zhang
b0d3704390 Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask 2025-05-14 02:00:22 +00:00
Qianfeng Zhang
5b0a2618fd Add -save_mask option to the example to output int8 mask tensor 2025-05-14 01:54:45 +00:00
Qianfeng Zhang
c3761c3bd6 Update the rules of hstu masking 2025-05-13 10:37:19 +00:00
Qianfeng Zhang
3a320bcdf6 Add test cases for better functional verification 2025-05-10 16:04:06 +00:00
Qianfeng Zhang
79cd1f0653 Fix sequence dim length for o_dram descriptor in the kernel 2025-05-10 16:02:52 +00:00
Qianfeng Zhang
1d1dd8f1eb Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
2025-05-07 13:37:15 +00:00
Qianfeng Zhang
d32851e15c Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack 2025-05-07 10:09:23 +00:00
Qianfeng Zhang
632fd06a7a Use kK1=16 2025-05-07 09:51:16 +00:00
Qianfeng Zhang
079f7e3a03 Use type_convert rather than static_cast in f_silu 2025-05-07 07:11:54 +00:00
Qianfeng Zhang
72d55d1b40 Add max_seqlen as divider in siLu 2025-05-06 16:18:52 +00:00
Qianfeng Zhang
374e0626e6 Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention 2025-05-06 08:24:03 +00:00
Qianfeng Zhang
611f2ce1f9 Override and fix GetAlignmentK() 2025-05-03 16:23:54 +00:00
Qianfeng Zhang
da89540ee0 Use kN0=32 2025-04-30 05:42:43 +00:00
Qianfeng Zhang
2972de4c88 Temporarily close the instance for hdim64 and hdim256 to save compiling time 2025-04-30 02:20:41 +00:00
Qianfeng Zhang
d63dab90da Hack block_gemm_areg_bsmem_creg_v2 to let s_acc for gemm_0 not need be cleared first 2025-04-28 15:24:13 +00:00
Qianfeng Zhang
f1f4e249a6 Adjust the v_tile and k_tile loading location 2025-04-28 09:25:09 +00:00
Qianfeng Zhang
f53be61a74 Put two gemms call inside one n0loop unroll 2025-04-28 06:41:37 +00:00
Qianfeng Zhang
1af27022ef Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false 2025-04-27 09:31:38 +00:00
Qianfeng Zhang
054c397e05 Replace set_tile_if() by sweep_tile_span() to reduce branching 2025-04-27 05:00:09 +00:00
Qianfeng Zhang
95c93ba92e Update the GridSize() and GetTileIndex() in hstu kernel 2025-04-26 10:01:23 +00:00
Qianfeng Zhang
1b463e915d Add scripts for measuring jagged with/no causal cases 2025-04-25 15:59:51 +00:00