Commit Graph

228 Commits

Author SHA1 Message Date
Qianfeng Zhang
a5f24d7470 Change while() do to do while() for the main loop to let the compiler to generate more elegant codes 2025-06-21 13:51:42 +00:00
Qianfeng Zhang
f9caae2d8b Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN 2025-06-18 16:02:34 +00:00
Qianfeng Zhang
09ac14604c Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks 2025-06-18 16:02:04 +00:00
root
9e62359b59 Tiny fix in hstu attention IsFullTileInsideMask() 2025-06-18 16:01:40 +00:00
Qianfeng Zhang
08886e99d5 Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control 2025-06-10 15:44:29 +00:00
Qianfeng Zhang
4632d30cc0 Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict 2025-06-08 13:25:57 +00:00
Qianfeng Zhang
84eb9adc71 Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility 2025-06-07 15:01:19 +00:00
Qianfeng Zhang
b2db644dcd Add assert(contextual_seqlen >= 0) in example 2025-06-06 15:56:04 +00:00
Qianfeng Zhang
d7930cd541 Update IsFulleTileInsideMask() for kUseLocal is true situtation 2025-06-06 15:55:26 +00:00
Qianfeng Zhang
9e6a24010a Move all test and bench scripts to folder scripts 2025-06-06 08:22:38 +00:00
Qianfeng Zhang
2bb59dff0a Add two scripts 2025-06-06 08:15:25 +00:00
Qianfeng Zhang
9582ae2dff Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes 2025-05-31 06:22:54 +00:00
Qianfeng Zhang
bec35abafe Tune the settings for hdim-256 2025-05-30 08:51:49 +00:00
Qianfeng Zhang
832747c58d Add example parameter alpha to ease the testing 2025-05-30 08:51:24 +00:00
Qianfeng Zhang
781cba355a Convert P to fp16/bf16 before doing second gemm in reference hstu implementation 2025-05-29 01:07:04 +00:00
Qianfeng Zhang
36a0f2020c not-critical updates in example and block_masking codes 2025-05-29 01:06:32 +00:00
Qianfeng Zhang
68a5ab8ff8 Add init_qkv and dump_output example parameters for easier debugging 2025-05-29 01:05:49 +00:00
Qianfeng Zhang
10c35125d2 Add example parameter max_seqlen and max_target 2025-05-27 14:43:43 +00:00
Qianfeng Zhang
c9e19351c7 Update to the method for calculating max_seqlen in the example 2025-05-27 10:38:52 +00:00
Qianfeng Zhang
dc0977faad Use NRepetitions2DEpilogue for outputing o_acc tile 2025-05-26 14:43:30 +00:00
Qianfeng Zhang
81f7b139e0 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting 2025-05-22 09:49:11 +00:00
Qianfeng Zhang
a1346aaf3e Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm 2025-05-20 07:53:25 +00:00
Qianfeng Zhang
0a8ea6bd02 Adjust the threshold values for fp16/bf16 in the example 2025-05-20 07:52:37 +00:00
Qianfeng Zhang
29cf1610f1 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt 2025-05-20 07:52:01 +00:00
Qianfeng Zhang
fac03abbda Change do-while main-loop to while-do and remove early exiting check 2025-05-19 15:39:31 +00:00
Qianfeng Zhang
14ab6f154d Adjust the codes before the main-loop 2025-05-19 13:59:57 +00:00
Qianfeng Zhang
f411d676f2 Move k_tile loading and v_tile loading earlier in the loop 2025-05-19 13:59:24 +00:00
Qianfeng Zhang
902b1c645c Move k_tile loading in the loop earlier 2025-05-19 13:58:43 +00:00
Qianfeng Zhang
f582c21418 Replace s_acc and pcomp tile array by single tile object for simplification 2025-05-19 13:57:11 +00:00
Qianfeng Zhang
4e65469fe8 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation 2025-05-18 16:20:47 +00:00
Qianfeng Zhang
e4e70f8b0a Set the block_per_cu to 3 for hdim-128 2025-05-18 15:58:57 +00:00
Qianfeng Zhang
ff3415d97d Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 2025-05-18 15:26:31 +00:00
Qianfeng Zhang
694295a9d3 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:43 +00:00
Qianfeng Zhang
afd7793e92 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:05 +00:00
Qianfeng Zhang
7c0ac51b4b Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 2025-05-18 15:24:17 +00:00
Qianfeng Zhang
473fbc374b Rename the hacked block_gemm_areg_bsmem_creg_v2 2025-05-18 15:23:40 +00:00
Qianfeng Zhang
58e45ec53a Move the lambda for dividing by max_seqlen from kernel to pipeline 2025-05-18 07:58:52 +00:00
Qianfeng Zhang
0771390a28 Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop 2025-05-18 03:41:31 +00:00
Qianfeng Zhang
586968785a Set example option -save_mask default to 0 2025-05-14 13:56:54 +00:00
Qianfeng Zhang
b0d3704390 Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask 2025-05-14 02:00:22 +00:00
Qianfeng Zhang
5b0a2618fd Add -save_mask option to the example to output int8 mask tensor 2025-05-14 01:54:45 +00:00
Qianfeng Zhang
c3761c3bd6 Update the rules of hstu masking 2025-05-13 10:37:19 +00:00
Qianfeng Zhang
3a320bcdf6 Add test cases for better functional verification 2025-05-10 16:04:06 +00:00
Qianfeng Zhang
79cd1f0653 Fix sequence dim length for o_dram descriptor in the kernel 2025-05-10 16:02:52 +00:00
Qianfeng Zhang
1d1dd8f1eb Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
2025-05-07 13:37:15 +00:00
Qianfeng Zhang
d32851e15c Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack 2025-05-07 10:09:23 +00:00
Qianfeng Zhang
632fd06a7a Use kK1=16 2025-05-07 09:51:16 +00:00
Qianfeng Zhang
079f7e3a03 Use type_convert rather than static_cast in f_silu 2025-05-07 07:11:54 +00:00
Qianfeng Zhang
72d55d1b40 Add max_seqlen as divider in siLu 2025-05-06 16:18:52 +00:00
Qianfeng Zhang
374e0626e6 Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention 2025-05-06 08:24:03 +00:00