Commit Graph

  • ca36501613 Moving code-lines in hstu pipeline Qianfeng Zhang 2025-07-07 09:49:16 +00:00
  • d921b14082 Remove using i_loop and num_loops since seqlen_k_curr and seqlen_k_end is enough Qianfeng Zhang 2025-07-06 14:31:36 +00:00
  • ac5194f783 Let causal == 0 cases to do IsFullTileInsideMask() checking before calling IsTokenPairInsideMask() Qianfeng Zhang 2025-06-26 10:23:03 +00:00
  • ff807ddd1a Tiny movement in the code lines of the pipeline Qianfeng Zhang 2025-06-26 10:09:51 +00:00
  • 818722a3c6 Use two work-groups per compute-unit for scheduling the kernel Qianfeng Zhang 2025-06-26 09:46:33 +00:00
  • 1a917993cf Simplify the codes in all host/device IsTokenPairInsideMask() trying to reduce branching Qianfeng Zhang 2025-06-23 14:13:55 +00:00
  • 57050d93a6 Fix masking for min_full_attn_seqlen > 0 situation Qianfeng Zhang 2025-06-22 16:23:57 +00:00
  • 508f4ac632 Update to test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py to align their testings Qianfeng Zhang 2025-06-22 15:20:47 +00:00
  • 49a12df133 Completely remove the dependency to include/ck_tile/ops/fmha/ops headers Qianfeng Zhang 2025-06-22 11:29:03 +00:00
  • f62c52a499 Fix in using KV LdsBuffers to avoid un-expected over-writting that causes un-deterministic results Qianfeng Zhang 2025-06-21 13:48:14 +00:00
  • 9cdd64f337 Change while() do to do while() for the main loop to let the compiler to generate more elegant codes Qianfeng Zhang 2025-06-21 12:58:27 +00:00
  • 3413400044 Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN Qianfeng Zhang 2025-06-18 15:57:04 +00:00
  • dcef0260ff Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks Qianfeng Zhang 2025-06-18 15:31:47 +00:00
  • 07635af84c Tiny fix in hstu attention IsFullTileInsideMask() root 2025-06-18 15:12:05 +00:00
  • e033a82bd7 Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control Qianfeng Zhang 2025-06-10 15:43:19 +00:00
  • d39ba52e07 Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict Qianfeng Zhang 2025-06-08 11:22:21 +00:00
  • 5f16f5db20 Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility Qianfeng Zhang 2025-06-07 12:46:40 +00:00
  • 88a0838453 Add assert(contextual_seqlen >= 0) in example Qianfeng Zhang 2025-06-06 14:22:07 +00:00
  • 5944a63f11 Update IsFulleTileInsideMask() for kUseLocal is true situtation Qianfeng Zhang 2025-06-06 14:20:41 +00:00
  • cbc5485589 Move all test and bench scripts to folder scripts Qianfeng Zhang 2025-06-06 08:22:38 +00:00
  • e36446aca9 Add two scripts Qianfeng Zhang 2025-06-06 08:14:12 +00:00
  • a8738f4455 Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes Qianfeng Zhang 2025-05-30 16:02:45 +00:00
  • 3cc6f4abc8 Tune the settings for hdim-256 Qianfeng Zhang 2025-05-30 08:49:33 +00:00
  • fb5aa39762 Add example parameter alpha to ease the testing Qianfeng Zhang 2025-05-30 08:47:55 +00:00
  • e59ec37ffb Convert P to fp16/bf16 before doing second gemm in reference hstu implementation Qianfeng Zhang 2025-05-29 01:04:19 +00:00
  • c83d8587c7 not-critical updates in example and block_masking codes Qianfeng Zhang 2025-05-29 01:02:20 +00:00
  • 08e381d632 Add init_qkv and dump_output example parameters for easier debugging Qianfeng Zhang 2025-05-28 15:33:54 +00:00
  • 85bc8fd805 Add example parameter max_seqlen and max_target Qianfeng Zhang 2025-05-27 14:18:41 +00:00
  • 46301a85d9 Update to the method for calculating max_seqlen in the example Qianfeng Zhang 2025-05-27 10:36:43 +00:00
  • a079b95b77 Use NRepetitions2DEpilogue for outputing o_acc tile Qianfeng Zhang 2025-05-26 14:09:55 +00:00
  • eba3242ab8 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting Qianfeng Zhang 2025-05-21 16:44:39 +00:00
  • d74b41070f Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm Qianfeng Zhang 2025-05-20 07:50:56 +00:00
  • 4833daf43d Adjust the threshold values for fp16/bf16 in the example Qianfeng Zhang 2025-05-20 07:48:54 +00:00
  • 6e38888f46 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt Qianfeng Zhang 2025-05-20 07:46:21 +00:00
  • 2b94d9261c Change do-while main-loop to while-do and remove early exiting check Qianfeng Zhang 2025-05-19 15:38:17 +00:00
  • abc8335c43 Adjust the codes before the main-loop Qianfeng Zhang 2025-05-19 11:24:59 +00:00
  • 946e917e2c Move k_tile loading and v_tile loading earlier in the loop Qianfeng Zhang 2025-05-19 10:26:45 +00:00
  • 45ac659ae0 Move k_tile loading in the loop earlier Qianfeng Zhang 2025-05-19 10:16:09 +00:00
  • 109dcfc2f0 Replace s_acc and pcomp tile array by single tile object for simplification Qianfeng Zhang 2025-05-19 07:46:57 +00:00
  • 40056b95a9 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation Qianfeng Zhang 2025-05-18 16:19:30 +00:00
  • 8c43b793c9 Set the block_per_cu to 3 for hdim-128 Qianfeng Zhang 2025-05-18 15:58:02 +00:00
  • 1bbefda240 Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 Qianfeng Zhang 2025-05-18 15:02:36 +00:00
  • 11718b0af4 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 Qianfeng Zhang 2025-05-18 14:02:17 +00:00
  • d01b4f27c6 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 Qianfeng Zhang 2025-05-18 13:40:38 +00:00
  • 4545d2efc1 Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 Qianfeng Zhang 2025-05-15 14:02:02 +00:00
  • 6e7553be77 Rename the hacked block_gemm_areg_bsmem_creg_v2 Qianfeng Zhang 2025-05-15 10:38:15 +00:00
  • e5977717a8 Move the lambda for dividing by max_seqlen from kernel to pipeline Qianfeng Zhang 2025-05-18 07:56:34 +00:00
  • 70237d2e5c Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop Qianfeng Zhang 2025-05-18 03:21:36 +00:00
  • 27e64a682a Set example option -save_mask default to 0 Qianfeng Zhang 2025-05-14 13:44:24 +00:00
  • 0ee9dff5cb Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask Qianfeng Zhang 2025-05-14 02:00:22 +00:00
  • 4ff88b4400 Add -save_mask option to the example to output int8 mask tensor Qianfeng Zhang 2025-05-14 01:54:45 +00:00
  • 124539e123 Update the rules of hstu masking Qianfeng Zhang 2025-05-13 10:37:19 +00:00
  • b2cd7757f0 Add test cases for better functional verification Qianfeng Zhang 2025-05-10 16:04:06 +00:00
  • 3d83d23a55 Fix sequence dim length for o_dram descriptor in the kernel Qianfeng Zhang 2025-05-10 16:02:52 +00:00
  • 010b3f48b3 Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time" Qianfeng Zhang 2025-05-07 13:37:15 +00:00
  • bce38c1531 Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack Qianfeng Zhang 2025-05-07 10:09:23 +00:00
  • 87b5aa78bd Use kK1=16 Qianfeng Zhang 2025-05-07 09:51:16 +00:00
  • bce88a9e73 Use type_convert rather than static_cast in f_silu Qianfeng Zhang 2025-05-07 07:05:43 +00:00
  • 9c3e49a1d0 Add max_seqlen as divider in siLu Qianfeng Zhang 2025-05-06 16:16:23 +00:00
  • 717aae7ce7 Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention Qianfeng Zhang 2025-05-06 08:24:03 +00:00
  • 380165c3dc Override and fix GetAlignmentK() Qianfeng Zhang 2025-05-03 16:17:28 +00:00
  • 34998cfd19 Use kN0=32 Qianfeng Zhang 2025-04-30 05:42:43 +00:00
  • 94f8d71ee2 Temporarily close the instance for hdim64 and hdim256 to save compiling time Qianfeng Zhang 2025-04-30 01:54:43 +00:00
  • 9df0fad750 Hack block_gemm_areg_bsmem_creg_v2 to let s_acc for gemm_0 not need be cleared first Qianfeng Zhang 2025-04-28 15:11:54 +00:00
  • 6bf4877a20 Adjust the v_tile and k_tile loading location Qianfeng Zhang 2025-04-28 09:25:09 +00:00
  • ba037426c5 Put two gemms call inside one n0loop unroll Qianfeng Zhang 2025-04-28 06:41:37 +00:00
  • 23852ef4c0 Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false Qianfeng Zhang 2025-04-27 09:31:38 +00:00
  • ddd9227453 Replace set_tile_if() by sweep_tile_span() to reduce branching Qianfeng Zhang 2025-04-27 05:00:09 +00:00
  • 187f4d3f68 Update the GridSize() and GetTileIndex() in hstu kernel Qianfeng Zhang 2025-04-26 10:01:23 +00:00
  • 9a08e1090e Add scripts for measuring jagged with/no causal cases Qianfeng Zhang 2025-04-25 15:59:51 +00:00
  • c0128a9156 Tiny update in IsTokenPairInsideMask() Qianfeng Zhang 2025-04-25 15:36:58 +00:00
  • ac0e593e0d Use compiler builtin directly in f_silu for float type Qianfeng Zhang 2025-04-25 15:04:41 +00:00
  • 31c21c74d8 Code re-arrangement in pipeline Qianfeng Zhang 2025-04-25 14:16:29 +00:00
  • eb2564fe46 Update the seqlen_k_curr inside the first gemm loop Qianfeng Zhang 2025-04-25 13:59:48 +00:00
  • 40683ee932 Rename the performance measurement scripts Qianfeng Zhang 2025-04-25 06:09:17 +00:00
  • 79fdd564b8 Add support for WarpGem-16x16x32 in QK-BlockGemm (which enables using ds_write/read_b128 for K Qianfeng Zhang 2025-04-25 06:06:50 +00:00
  • 1986d8c578 Update in K-Lds laying-out to consider for both WarpGemm-32x32x16 and WarpGemm-16x16x16 Qianfeng Zhang 2025-04-24 15:02:57 +00:00
  • 6dd83b2a5a Use 16x16x16 WarpGemm Qianfeng Zhang 2025-04-24 07:59:28 +00:00
  • 7153a99dd4 Using __builtin_amdgcn_rcpf in siLU function Qianfeng Zhang 2025-04-24 06:28:16 +00:00
  • fb89a013b7 Combine minus with scale_s Qianfeng Zhang 2025-04-24 05:47:24 +00:00
  • 23e80a5964 Move silu calculation to gemm1 iteration and try to interleave gemm_1 and silu Qianfeng Zhang 2025-04-23 13:10:02 +00:00
  • f40d68c1a9 Update in using masking for the case where kMasking is false and kPadSeqLenK is true Qianfeng Zhang 2025-04-23 10:47:27 +00:00
  • 95b9a277ac Fix in generate_instances.py and re-generated the instances Qianfeng Zhang 2025-04-23 04:00:21 +00:00
  • e5fb03a4aa Back to use exp() instead of exp2() since exp() in ck_tile using fast __builtin_amdgcn_exp2f() Qianfeng Zhang 2025-04-21 14:41:51 +00:00
  • 266e7bc8e9 Use kN0=64 to save vgprs Qianfeng Zhang 2025-04-22 14:45:27 +00:00
  • 8f7a97fe02 Fix the script name Qianfeng Zhang 2025-04-22 13:43:11 +00:00
  • bace12feac Fix in GetTileRangeAlongX Qianfeng Zhang 2025-04-22 13:42:03 +00:00
  • 7a7c17802a Add script compare_with_triton_2.sh for measuring the jagged cases of seqlen 1024/2048/4096/8192/16384/32768 Qianfeng Zhang 2025-04-22 10:16:20 +00:00
  • e731437af1 Change gemm0 to iterate along kN0 so that BlockGemm can overlap with maksing and siLu Qianfeng Zhang 2025-04-19 15:52:51 +00:00
  • 8da21d9cde Fix the GetTileRangeAlongX() to align with the hstu masking definition when both causal=true and local=true Qianfeng Zhang 2025-04-18 15:37:49 +00:00
  • 065776d42d Remove un-needed __builtin_amdgcn_sched_barrier(0) Qianfeng Zhang 2025-04-18 10:05:57 +00:00
  • 63acd4638b Use shared ring Lds buffers for K/V to avoid over-lapping between first-K/last-V or last-K/first-V Qianfeng Zhang 2025-04-18 09:47:43 +00:00
  • 58090fe730 Tiny codes simplification in pipeline Qianfeng Zhang 2025-04-18 08:22:11 +00:00
  • ec14e9df3e Remove one line of __builtin_amdgcn_sched_barrier(0) Qianfeng Zhang 2025-04-17 14:21:14 +00:00
  • c0609d49cd Fix the integer overflow in total_flops calculation Qianfeng Zhang 2025-04-17 10:34:13 +00:00
  • 1efb2a8f38 Add scripts for comparing with triton Qianfeng Zhang 2025-04-17 10:33:44 +00:00
  • 70b4aa310f Use exp2() to calculate exp() for better performance Qianfeng Zhang 2025-04-16 06:54:06 +00:00
  • 45019fd5fd Remove the comparing of row/col to max_uih_len in masking Qianfeng Zhang 2025-04-16 04:35:42 +00:00
  • ad10a2dd53 Use kM0=128 kN0=64 to completely remove the vgprs spilling Qianfeng Zhang 2025-04-15 15:08:46 +00:00
  • 8b2948b31e Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations Qianfeng Zhang 2025-04-15 14:40:55 +00:00