Commit Graph

  • 2089713f94 [rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5) develop Kiefer van Teutem 2026-06-26 12:00:58 +00:00
  • 621697af8c [rocm-libraries] ROCm/rocm-libraries#8723 (commit e2f28c1) Illia Silin 2026-06-25 19:06:13 +00:00
  • 713f1fbf46 [rocm-libraries] ROCm/rocm-libraries#8739 (commit 833c182) users/andriy/ck/1464-pytorch-gfx1250 spolifroni-amd 2026-06-24 18:32:34 +00:00
  • e503e6277a [rocm-libraries] ROCm/rocm-libraries#8762 (commit fe88750) chris-tsiaousis-hpc 2026-06-24 16:41:50 +00:00
  • 137f2a9a10 [rocm-libraries] ROCm/rocm-libraries#7407 (commit 0b79e05) Kiefer van Teutem 2026-06-24 13:35:25 +00:00
  • 563f970e10 Support OAI SwiGLU in MoE epilogue lirui/vllm_atom_m3_0624 lirui927 2026-06-24 17:34:07 +08:00
  • bdc300591b Removing the class derivation to simplify struct HstuAttentionFwdCommonDropoutKargs hstu_attention_fwd Qianfeng Zhang 2026-06-24 09:03:39 +00:00
  • d5c872f504 Removing the class derivation to simplify struct HstuAttentionFwdCommonDropoutKargs hstu_attention_fwd_bwd Qianfeng Zhang 2026-06-24 09:03:39 +00:00
  • 43f37a22a1 CK-UA: XCD-balanced decode grid swizzle + gated decode interleave jukorhon/fa4-k-preread juuso-oskari 2026-06-24 08:27:34 +00:00
  • bd3713c710 [rocm-libraries] ROCm/rocm-libraries#8716 (commit 8230b20) Illia Silin 2026-06-23 18:12:18 +00:00
  • 9073fa6235 WIP: fix assembly test_async_v3 Enrico Degregori 2026-06-23 16:32:25 +00:00
  • 8207ddfcd1 Fix offset_ptrs_by_tile_coords being always true aiter_cktile_integration_a8w8_a4w4 Anton Gorenko 2026-06-23 15:51:47 +05:00
  • 0720bc48be Fix using multiplies Qianfeng Zhang 2026-06-23 10:20:17 +00:00
  • ad769baea4 Fix make_kernel() template parameters Qianfeng Zhang 2026-06-23 09:38:37 +00:00
  • 0b0684aff2 Revert "[ck_tile] Add get_partition_index_v2 which uses warp_id in vgpr and to be used by tile_windows on lds-based tensor_view" Qianfeng Zhang 2026-06-23 09:31:31 +00:00
  • 1805c985a0 Update to example_hstu_attention_fwd.cpp Qianfeng Zhang 2026-06-22 09:22:10 +00:00
  • dc1d433351 Rename generate_instances.py to generate_fwd_instances.py Qianfeng Zhang 2026-06-22 08:01:23 +00:00
  • 503a017c42 Remove the using of kSubQKHeaddim Qianfeng Zhang 2026-06-19 05:16:40 +00:00
  • a81f32331f Add restriction on the relationship between HstuAttention<xxx>Problem and HstuAttention<xxx>TileSettingClass Qianfeng Zhang 2026-06-17 15:58:25 +00:00
  • 79ebc7479d Add static_assert() in HstuAttentionFwdTileSettingClass Qianfeng Zhang 2026-06-14 13:36:46 +00:00
  • 7a67ae4dd3 Update to the comments in reference_hstu_attention_bwd.hpp Qianfeng Zhang 2026-06-12 15:31:16 +00:00
  • 7d2e575fed Fix the comments in reference_hstu_attention_fwd.hpp Qianfeng Zhang 2026-06-12 14:50:22 +00:00
  • d07b37c097 Remove un-used element-wise functions passed through pipelines' operator() interfaces Qianfeng Zhang 2026-06-12 08:40:18 +00:00
  • cc7e216fa6 Rename GetKVBlockGemm to GetPVTBlockGemm Qianfeng Zhang 2026-06-12 07:35:53 +00:00
  • 06547efd90 Remove the kHasBias==true instances to save building time Qianfeng Zhang 2026-06-12 05:31:09 +00:00
  • f41777fbd2 Renaming BUILD_HSTU_FOR_GFX95_ONLY to BUILD_HSTU_FOR_GFX95 Qianfeng Zhang 2026-06-10 15:30:08 +00:00
  • 295136e48b Update the README.md according to the summary by claude code Qianfeng Zhang 2026-06-10 14:42:49 +00:00
  • a1ad9fc312 Fix the using of num_targets[] in run_group_hstu_attention Qianfeng Zhang 2026-06-10 08:57:41 +00:00
  • 62627db768 Update to the comments in reference_hstu_attention_bwd.hpp Qianfeng Zhang 2026-06-09 08:14:07 +00:00
  • 08873f0d50 Renaming in the dispatching codes and generate_instances.py scripts Qianfeng Zhang 2026-06-08 15:21:43 +00:00
  • c7de3af246 Split hstu_attention_util.hpp into host_util.hpp and kernel_util.hpp Qianfeng Zhang 2026-06-06 06:41:25 +00:00
  • 89d6f5aa92 Remove un-needed includings from some hpp and cpp files Qianfeng Zhang 2026-06-06 04:47:24 +00:00
  • 6f4f3eac48 Tiny update in generate_instances.py Qianfeng Zhang 2026-06-06 04:46:03 +00:00
  • 673207ce59 Fix header file mapping bug in generate_instances.py Qianfeng Zhang 2026-06-06 03:44:35 +00:00
  • f73341de37 Some renaming in kernel and pipeline Qianfeng Zhang 2026-06-05 15:43:47 +00:00
  • 42a3bfbab7 Update and fix for leeked changes and make the scripts be able to test/benchmark kStoreLSE cases Qianfeng Zhang 2026-06-05 10:33:32 +00:00
  • b17b41a1e6 Enable the kernel dispatching path from is_training & use_softmax to kStoreLSE Qianfeng Zhang 2026-06-05 06:46:07 +00:00
  • 4414019296 Add instances and kStoreLSE template in dispatcher class to support outputting lse for fwd training Qianfeng Zhang 2026-06-05 01:06:11 +00:00
  • 8f83a2841f Set lse tensor dim strides in example Qianfeng Zhang 2026-06-05 00:32:39 +00:00
  • 02e5c23f9c Replace template kUseSoftmax/kStoreLSE by boolean parameters in reference fwd codes to save compiling time Qianfeng Zhang 2026-06-04 15:12:20 +00:00
  • bd126618d1 Add support for outputing lse in the example and reference hstu attention forward implementation Qianfeng Zhang 2026-06-04 14:51:16 +00:00
  • fa0a9c1656 Add support for preparing lse_dram_window in hstu fwd kernel Qianfeng Zhang 2026-06-04 09:58:13 +00:00
  • fd9af72c9f Kernel use types declared in the problem rather than the pipeline Qianfeng Zhang 2026-06-04 08:13:24 +00:00
  • ee5bd0ebba Tiny simplification with defining the Bias related Kargs Qianfeng Zhang 2026-06-03 09:43:44 +00:00
  • f41b0176d3 Add parameters used by storing lse in the fwd and fwd_splitkv_combine kernel to prepare for supporting training Qianfeng Zhang 2026-06-03 09:19:32 +00:00
  • 270d073c88 Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct Qianfeng Zhang 2026-06-02 15:43:44 +00:00
  • b2561b88e4 Add kStoreLSE template parameter to the problems Qianfeng Zhang 2026-05-28 15:25:11 +00:00
  • 2a86bfb6f5 Implement host reference operator for hstu attention backward Qianfeng Zhang 2026-05-28 10:25:34 +00:00
  • ec58f92f05 Rename the reference interfaces and the files Qianfeng Zhang 2026-05-28 08:07:54 +00:00
  • 7d317adf37 Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp Qianfeng Zhang 2026-05-23 07:22:29 +00:00
  • 30b5d7bd01 Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf) Qianfeng Zhang 2026-05-23 04:37:52 +00:00
  • 9a7cc5b4a3 Use partition_index parameter for all get_x_indices_from_distributed_indices() calls Qianfeng Zhang 2026-05-22 15:08:30 +00:00
  • 9fbe96ab76 Update to the cross_attention test/bench scripts Qianfeng Zhang 2026-05-22 14:44:14 +00:00
  • 24329f15d1 Add implementation of hstu fwd splitkv for softmax path Qianfeng Zhang 2026-05-20 16:14:25 +00:00
  • d2bc927242 Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack Qianfeng Zhang 2026-05-11 08:53:07 +00:00
  • ea5af27b62 Re-format the .hpp/.cpp files using clang-format-18 Qianfeng Zhang 2026-05-10 13:46:02 +00:00
  • bd8a87301b Fix potential bug in kernel host interface BlockSize() Qianfeng Zhang 2026-05-08 06:27:19 -04:00
  • f73d7d2f8a More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy Qianfeng Zhang 2026-05-07 06:32:01 -04:00
  • e48bcff488 Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950 Qianfeng Zhang 2026-04-30 14:06:06 +00:00
  • c6dfe030d0 Add -fno-slp-vectorize option for building hstu kernels on gfx950 Qianfeng Zhang 2026-04-30 13:37:22 +00:00
  • 1f4319ce91 Use include <...> format to refer to header files from ck_tile Qianfeng Zhang 2026-04-27 10:15:14 +00:00
  • a97c7a75ce Mark low probability branch as unlikely in the softmax pipelines Qianfeng Zhang 2026-04-27 06:39:32 +00:00
  • 68bbcac775 Use type_convert to convert float constant to CompDataType Qianfeng Zhang 2026-04-24 15:46:26 +00:00
  • d243b275da Implement conditional softmax rescale in trload with_softmax pipeline Qianfeng Zhang 2026-04-24 09:54:12 +00:00
  • 671c65e178 Implement conditional softmax rescale in non-trload with_softmax pipeline Qianfeng Zhang 2026-04-24 09:22:44 +00:00
  • cfbd8a342a Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh Qianfeng Zhang 2026-04-24 07:36:41 +00:00
  • 62bf2296c6 Remove exposing kUseTrLoad as template parameter of pipeline problem Qianfeng Zhang 2026-04-21 15:35:03 +00:00
  • 67f9461b42 Simplification in the cross_attention testing/benchmarking scripts Qianfeng Zhang 2026-04-17 09:38:41 +00:00
  • bd46155431 Remove max_target 3200 cases from cross_attention testing and benchmarking Qianfeng Zhang 2026-04-17 09:17:38 +00:00
  • f99ed6225b Clarify the using the max_seqlen and max_seqlen_q Qianfeng Zhang 2026-04-17 09:13:45 +00:00
  • 6f2a73b17d Add scripts for testing/benchmarking cross_attention cases Qianfeng Zhang 2026-04-16 15:45:57 +00:00
  • 8a7f0a8e99 Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example Qianfeng Zhang 2026-04-15 16:18:43 +00:00
  • c0922a6cb8 Add implementation of fwd splitkv on no_softmax path Qianfeng Zhang 2026-04-15 07:14:40 +00:00
  • 410f472a33 Remove dropout=true instances to reduce compiling-time Qianfeng Zhang 2026-04-07 09:38:18 +00:00
  • 686125c0cd Rename default_policy to policy for hstu_attention forward Qianfeng Zhang 2026-04-07 08:41:14 +00:00
  • 27d01448d0 Move the calling of mask.GetTileRangeAlongX() to the kernel Qianfeng Zhang 2026-03-28 14:19:22 +00:00
  • f99d6c4112 Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile() Qianfeng Zhang 2026-03-26 14:40:01 +00:00
  • ac9a142e63 Enable run-time selection of MTile sizes according to the predicted CU utilization ratio Qianfeng Zhang 2026-03-19 10:39:37 +00:00
  • 0462d44215 Update to support grouped mode hstu attention Qianfeng Zhang 2026-03-09 16:15:58 +00:00
  • 4781582b0f Using in-place version of block_tile_reduce() so that using of m_local is avoided Qianfeng Zhang 2026-03-05 16:27:41 +00:00
  • d206045c53 Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane() Qianfeng Zhang 2026-02-21 14:46:31 +00:00
  • 57d837977b Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts Qianfeng Zhang 2026-02-09 15:55:13 +00:00
  • 09656528a6 Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument Qianfeng Zhang 2026-02-09 09:02:17 +00:00
  • d55d6a19a7 Update to hstu masking to separate the implementation for cross-attention and self-attention Qianfeng Zhang 2026-02-08 08:06:47 +00:00
  • 3b674ee8c9 Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention Qianfeng Zhang 2026-02-06 15:40:07 +00:00
  • d8f7e5a791 Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950 Qianfeng Zhang 2026-02-05 15:57:18 +00:00
  • 7d110e3872 Add softmax selection to two of the testing scripts Qianfeng Zhang 2026-02-05 14:56:31 +00:00
  • 1e199d0641 [Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling Qianfeng Zhang 2026-02-02 15:59:38 +00:00
  • 24d6e49323 Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256 Qianfeng Zhang 2026-01-30 09:58:12 +00:00
  • ea9e4d8e00 Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q Qianfeng Zhang 2026-01-25 14:51:53 +00:00
  • cf971db352 Fix in K-LdsBuffer and V-LdsBuffer over-lap checking Qianfeng Zhang 2025-12-27 05:43:11 +00:00
  • bcedecce4f Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop Qianfeng Zhang 2025-12-26 15:13:28 +00:00
  • 500f2245f7 Tiny fix in using v_tiles[] index Qianfeng Zhang 2025-12-25 15:37:22 +00:00
  • f03750674a Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving Qianfeng Zhang 2025-12-25 14:32:44 +00:00
  • f0c4f1bc85 Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving Qianfeng Zhang 2025-12-25 14:30:57 +00:00
  • e448412a5a Load Q directly from global memory to registers for BlockGemm Qianfeng Zhang 2025-12-20 13:35:45 +00:00
  • 90988313a9 Remove un-used including from default policy file Qianfeng Zhang 2025-12-19 09:58:37 +00:00
  • c4b8663c00 Move common codes to detail namespace from Problem class scope Qianfeng Zhang 2025-12-17 10:27:20 +00:00
  • e80c99b672 Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f) Qianfeng Zhang 2025-12-17 07:35:19 +00:00
  • 97136bd3f7 Add support of loading QK tiles of hdim96 without padding to hdim128 Qianfeng Zhang 2025-12-14 04:20:05 +00:00