Commit Graph

  • 13fdb382b2 Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile Qianfeng Zhang 2025-12-16 14:31:11 +00:00
  • d7ddc76542 Rename WarpTile in fwd setting Qianfeng Zhang 2025-12-14 16:21:54 +00:00
  • 7d9221a605 Simplifying the codes in defining KDram and QDram tile distribution Qianfeng Zhang 2025-12-14 13:50:49 +00:00
  • d8f0862ff8 Tiny update in GetMaxVectorSize() Qianfeng Zhang 2025-12-14 04:26:30 +00:00
  • d0fab4c34c Rename and add scripts for testing hdim96 Qianfeng Zhang 2025-12-12 15:23:01 +00:00
  • cb6fef75ca Enable hdim96 instances Qianfeng Zhang 2025-12-12 14:54:11 +00:00
  • 69c7921dfa Fix with regard to define stride in MakeKLdsBlockDescriptor() Qianfeng Zhang 2025-12-12 09:27:17 +00:00
  • 8590f4d71c Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV Qianfeng Zhang 2025-12-11 10:47:54 +00:00
  • 346c667470 Further correction with regard to using n0_loops and k1_loops Qianfeng Zhang 2025-12-08 15:00:21 +00:00
  • 8120a86ce6 Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning Qianfeng Zhang 2025-12-08 10:47:04 +00:00
  • 84949d4812 Simplify the codes in block_gemm Qianfeng Zhang 2025-12-06 14:59:49 +00:00
  • 9e252c1ab7 Further clarification in using kSubQKHeaddim and kQKHeaddim Qianfeng Zhang 2025-12-03 09:28:15 +00:00
  • b5178551de Clarify the using of kSubQKHeaddim and kQKHeaddim Qianfeng Zhang 2025-12-03 08:18:13 +00:00
  • e4e22cb2d9 Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines Qianfeng Zhang 2025-12-01 14:34:53 +00:00
  • 6ac36b9459 Tiny fix in GetQKBlockGemm Qianfeng Zhang 2025-11-30 14:04:48 +00:00
  • ff54459c23 Enable the using of WarpTile-32x32x16 and add scripts to verify Qianfeng Zhang 2025-11-29 16:18:31 +00:00
  • c4fc7b28c8 Add static_assert and comments in the with_softmax pipelines Qianfeng Zhang 2025-11-28 14:49:33 +00:00
  • c98688d5ad Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950 Qianfeng Zhang 2025-11-28 13:45:20 +00:00
  • fa5b077b91 Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline Qianfeng Zhang 2025-11-27 15:30:53 +00:00
  • 311fb8a379 Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md Qianfeng Zhang 2025-11-27 08:04:47 +00:00
  • cf2172b580 Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window Qianfeng Zhang 2025-11-22 16:12:21 +00:00
  • 45590fb8b1 Remove useless codes in the two trload pipelines Qianfeng Zhang 2025-11-15 13:48:00 +00:00
  • ec9d1fe253 Separate Traits from Problem while being used for defining the pipeline Qianfeng Zhang 2025-11-14 16:08:14 +00:00
  • 7655ffd2fb Remove the k_element_func and v_element_func from the pipeline since they are not used Qianfeng Zhang 2025-11-13 08:43:05 +00:00
  • 2ebe159050 Update to the two trload pipeline to load whole Q-tile once through LDS on mi350 Qianfeng Zhang 2025-11-12 15:59:38 +00:00
  • 0253442ef9 Simplify the codes in block_gemm_areg_bsmem_creg_v2_hack_1 Qianfeng Zhang 2025-11-10 15:52:12 +00:00
  • 0e1444a73c Simplify the codes in block_gemm_areg_bsmem_trload_creg Qianfeng Zhang 2025-11-10 15:27:34 +00:00
  • 76b515e092 [Performance] Change the tile settings for mi350/trload no_softmax pipeline to enable to use mfma-16x16x32 for Gemm-1 Qianfeng Zhang 2025-11-06 08:20:11 +00:00
  • f608f29401 Improve the softmax+trload pipeline by using kN0=64 and prefetch only two k tiles Qianfeng Zhang 2025-11-05 16:23:05 +00:00
  • 7d9a95dbe0 Tiny fix in trload with_softmax/no_softmax pipeline Qianfeng Zhang 2025-11-05 14:18:37 +00:00
  • e7bfee6043 Improve both the with_softmax and no_softmax pipelines Qianfeng Zhang 2025-11-04 15:18:58 +00:00
  • 3e122a551d Add kUseTrLoad = false in non-trload pipeline Qianfeng Zhang 2025-11-03 12:40:16 +00:00
  • 660604635d Clarifying the using of CK_TILE_HOST and CK_TILE_HOST_DEVICE trying to save compiling time Qianfeng Zhang 2025-11-03 08:39:43 +00:00
  • 3565656005 Change in updating max_uih_seqlen in the example Qianfeng Zhang 2025-11-02 03:54:51 +00:00
  • 639332b5de Use supplement_array_by_last_element(num_targets, ) in example Qianfeng Zhang 2025-11-02 03:30:13 +00:00
  • 19a887cffa Use supplement_array_by_last_element() in example to simplify the codes Qianfeng Zhang 2025-11-01 16:20:38 +00:00
  • 9537b958ab Update to README.md Qianfeng Zhang 2025-11-01 13:16:50 +00:00
  • 2ae759c5b3 Add scripts for testing the using of separate sequence lengths for k/v Qianfeng Zhang 2025-11-01 13:12:36 +00:00
  • a926db031a Support separate sequence lengths for q and kv Qianfeng Zhang 2025-10-31 14:04:32 +00:00
  • bcefc197d8 Use separate pipelines for using or not-using softmax situations Qianfeng Zhang 2025-10-30 08:01:22 +00:00
  • 4b3901e989 Implementation of hstu attention pipeline using trload for v on mi350 Qianfeng Zhang 2025-10-27 14:54:36 +00:00
  • 5db875df2c Fix in the comments Qianfeng Zhang 2025-10-27 10:36:15 +00:00
  • 563229a9e4 Update to gemm_0's CBlockDistribution encoding so that it is compatible with gemm_1's ABlockDistribution encoding Qianfeng Zhang 2025-10-27 10:34:45 +00:00
  • 4e9d4f7487 Using separate tile settings for no-softmax and with-softmax hstu attention situations Qianfeng Zhang 2025-10-23 14:45:28 +00:00
  • 91afdfee40 Update to benchmark scripts to consider for using softmax Qianfeng Zhang 2025-10-23 10:02:22 +00:00
  • be3b27edd3 Add support of softmax in hstu attention Qianfeng Zhang 2025-10-16 16:02:45 +00:00
  • ec4f174ac4 Add template parameter to gemm_0 MakeCBlockTile() for the need of defining PcompBlockTileType Qianfeng Zhang 2025-10-16 15:41:26 +00:00
  • da50eea674 Move scaling by attn_scale to inside the main-loop Qianfeng Zhang 2025-10-15 09:24:44 +00:00
  • 3639fb8e38 Let IsTokenPairInsideMask() return bool type Qianfeng Zhang 2025-10-15 08:50:48 +00:00
  • a71c996049 Add instances to consider for adding softmax support Qianfeng Zhang 2025-10-14 09:40:23 +00:00
  • 0abb52004a Remove K0 from tile setting since it is not used Qianfeng Zhang 2025-10-13 16:01:50 +00:00
  • de198549ad Change to pipeline so that it is easier to add support of using softmax Qianfeng Zhang 2025-10-11 10:11:35 +00:00
  • 7e79736df7 Remove using IGLP method for instruction scheduling for kUseLocal true path Qianfeng Zhang 2025-10-11 06:38:32 +00:00
  • 08f50c2c51 Fix in GetQKBlockGemm() Qianfeng Zhang 2025-09-27 14:31:24 +00:00
  • 5199dbd027 Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm Qianfeng Zhang 2025-09-25 15:38:55 +00:00
  • b0710e8871 Remove useless constant statement in the kernel Qianfeng Zhang 2025-09-19 07:17:39 +00:00
  • 2c861d541f Remove un-necessary HSTU_CHECK() callings Qianfeng Zhang 2025-09-13 16:34:02 +00:00
  • bc5616f1dc Add HSTU_CHECK() and use it in example codes Qianfeng Zhang 2025-09-13 16:27:23 +00:00
  • f7f90c539e Smalle update in reference hstu attention Qianfeng Zhang 2025-09-13 06:42:46 +00:00
  • 9c4d76d96b Detach HstuBlockMask from pipeline definition and construct the HstuBlockMask type in the kernel according to window_size Qianfeng Zhang 2025-09-12 08:37:56 +00:00
  • 8313b34543 Unify the license statements on all the source files Qianfeng Zhang 2025-09-11 10:09:37 +00:00
  • 2668bb3aee Remove using MakeKargsImpl() to simplify the hstu kernel Qianfeng Zhang 2025-09-10 15:24:20 +00:00
  • a8c62920bf Clarify the using of kSubQKHeaddim and kQKHeaddim so that less regular hdim (eg. 96, 160) can be efficiently supported Qianfeng Zhang 2025-09-09 09:51:36 +00:00
  • 7a948eee1d Use set_slice_tilie() to replace direct thread_buffer assignment Qianfeng Zhang 2025-09-09 08:59:58 +00:00
  • 260ef2fdf2 Use xor transform to implement Q/K Lds descriptor for kKpack == 8 cases Qianfeng Zhang 2025-08-21 13:59:32 +00:00
  • 56dab1298f Remove selectable VLayout for simplifying the codes since hdim is always fatest dimension Qianfeng Zhang 2025-08-20 08:35:51 +00:00
  • e1ebd780c1 Using separate settings for gfx942 and gfx950 Qianfeng Zhang 2025-07-25 07:11:54 +00:00
  • 40091814c6 Tiny change in pipeline BlockGemm definition to adapt to the latest merging with develop branch Qianfeng Zhang 2025-08-18 14:51:26 +00:00
  • ee06d0b4fc Add norm_dist parameter for hstu example to select either normal or uniform distribution to initialize data Qianfeng Zhang 2025-08-12 03:04:27 +00:00
  • da5db1773d Tiny fix in HstuBlockMaskWithLocal::GetTileRangeAlongX() Qianfeng Zhang 2025-08-12 01:58:10 +00:00
  • 446b62ad82 Adjust the atol and rtol and fix the check_err() using in example_hstu_attention.cpp Qianfeng Zhang 2025-08-12 01:33:06 +00:00
  • dde0729405 Add simple handling for max_atten_seqlen bigger than max_uih_len situations Qianfeng Zhang 2025-08-10 13:23:44 +00:00
  • c649b9b049 Tiny fix and comments in HstuBlockMaskWithLocal::IsFullTimeInsideMask() Qianfeng Zhang 2025-08-10 06:09:13 +00:00
  • de47bfe752 Update HstuBlockMaskWithLocal::GetTileRangeAlongX, add comments and test cases for causal == false Qianfeng Zhang 2025-08-10 04:22:21 +00:00
  • 4ad55eab4d Update to support min_full_attn_seqlen be bigger than max_uih_len Qianfeng Zhang 2025-08-08 09:25:25 +00:00
  • fa7dba2cd1 [ck_tile] Merge get_partition_index() and get_partition_index_v2() to get_partition_index() with bool_constant parameter Qianfeng Zhang 2025-08-08 06:20:38 +00:00
  • 2b0f3791a6 [ck_tile] Add get_partition_index_v2 which uses warp_id in vgpr and to be used by tile_windows on lds-based tensor_view Qianfeng Zhang 2025-08-06 14:40:38 +00:00
  • 0b54f1f43d Add attn_scale MakeKargs() parameter support and update in example, reference codes Qianfeng Zhang 2025-08-03 03:33:08 +00:00
  • 6364b641b8 Replace the integer max_seqlen by float scale_p as kernel/pipeline parameter Qianfeng Zhang 2025-08-01 07:49:57 +00:00
  • 6784c0be5a Use __builtin_amdgcn_sched_barrier(0x1) to prevent the compiler from unexpected codes arrangement Qianfeng Zhang 2025-08-01 06:30:22 +00:00
  • 492c724b14 Fix added case in test_hstu_attention.sh Qianfeng Zhang 2025-07-25 15:12:05 +00:00
  • d3ed6ac473 Update in GetTileRangeAlongX to consider for non-causal+local_size>0 situation and add test case to test_hstu_attention.sh Qianfeng Zhang 2025-07-25 14:53:39 +00:00
  • 7d698c2b78 Add three scripts for verification of jagged causal cases Qianfeng Zhang 2025-07-25 11:18:32 +00:00
  • 14c955aade Fix in GetTileRangeAlongX() and IsFullTileInsideMask() of HstuBlockMaskWithLocal Qianfeng Zhang 2025-07-25 11:16:54 +00:00
  • 27019a61a0 Adjust the codes related to calculate i_m0 in the kernel Qianfeng Zhang 2025-07-23 13:23:11 +00:00
  • 056166bbeb [Performance] Use separate workgroups to handle seqlen scope [max_uih_len - minfull_attn_seqlen, seqlen] Qianfeng Zhang 2025-07-23 09:39:50 +00:00
  • b57939ff64 Fix comments in test_pytorch_hstu_mask.py scripts Qianfeng Zhang 2025-07-22 13:21:01 +00:00
  • 47c4a0c2ec Change the seqlen_q dim padding setting for o_dram and bias_dram Qianfeng Zhang 2025-07-22 13:13:45 +00:00
  • 7efc0e226a Correct some comments Qianfeng Zhang 2025-07-21 09:04:49 +00:00
  • 5b295efe1e Re-arrange the codes section for using sched_group_barrier Qianfeng Zhang 2025-07-21 08:16:22 +00:00
  • 34edc4391c Fix in using sched_group_barrier() Qianfeng Zhang 2025-07-21 07:34:54 +00:00
  • 1caef1fb89 Move store_tile() caled before the current iteration Qianfeng Zhang 2025-07-21 04:17:10 +00:00
  • 45a189d73d Revert "Disable support of hdim64 amnd hdim256 for quick compiling and testing" Qianfeng Zhang 2025-07-17 09:00:41 +00:00
  • 09aa41ba9c Disable support of hdim64 amnd hdim256 for quick compiling and testing Qianfeng Zhang 2025-07-17 08:59:32 +00:00
  • 140af31e86 Fix bug in generate_instances.py and re-generate the instances Qianfeng Zhang 2025-07-17 08:37:09 +00:00
  • 71b0641d75 Re-org the kernel parameters in HstuAttentionFwdBatchModeBaseKargs and HstuAttentionFwdJaggModeBaseKargs Qianfeng Zhang 2025-07-17 04:48:55 +00:00
  • d131327aff Remove num_target from HstuBlockMask class member since it overlaps the meaning of max_uih_len Qianfeng Zhang 2025-07-15 16:19:06 +00:00
  • 19542dd99e Fix the calculation of number of instructions used by sched_group_barrier Qianfeng Zhang 2025-07-15 08:59:48 +00:00
  • cad7c6b2af [Performance] use iglp compiler instruction to tune the codes around gemm0 for window_size > 0 situation Qianfeng Zhang 2025-07-14 16:01:30 +00:00
  • 28f08b6f38 Add including of block_dropout.hpp in the hstu kernel to avoid potential compiling failure Qianfeng Zhang 2025-07-11 00:15:53 +00:00