composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	ee5bd0ebba	Tiny simplification with defining the Bias related Kargs	2026-06-23 09:28:00 +00:00
Qianfeng Zhang	f41b0176d3	Add parameters used by storing lse in the fwd and fwd_splitkv_combine kernel to prepare for supporting training	2026-06-23 09:28:00 +00:00
Qianfeng Zhang	270d073c88	Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct	2026-06-23 09:28:00 +00:00
Qianfeng Zhang	b2561b88e4	Add kStoreLSE template parameter to the problems	2026-06-23 09:28:00 +00:00
Qianfeng Zhang	2a86bfb6f5	Implement host reference operator for hstu attention backward	2026-06-23 09:28:00 +00:00
Qianfeng Zhang	ec58f92f05	Rename the reference interfaces and the files	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	7d317adf37	Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	30b5d7bd01	Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf)	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	9a7cc5b4a3	Use partition_index parameter for all get_x_indices_from_distributed_indices() calls	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	9fbe96ab76	Update to the cross_attention test/bench scripts	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	24329f15d1	Add implementation of hstu fwd splitkv for softmax path	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	d2bc927242	Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	ea5af27b62	Re-format the .hpp/.cpp files using clang-format-18	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	bd8a87301b	Fix potential bug in kernel host interface BlockSize()	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	f73d7d2f8a	More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	e48bcff488	Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	c6dfe030d0	Add -fno-slp-vectorize option for building hstu kernels on gfx950	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	1f4319ce91	Use include <...> format to refer to header files from ck_tile	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	a97c7a75ce	Mark low probability branch as unlikely in the softmax pipelines	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	68bbcac775	Use type_convert to convert float constant to CompDataType	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	d243b275da	Implement conditional softmax rescale in trload with_softmax pipeline	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	671c65e178	Implement conditional softmax rescale in non-trload with_softmax pipeline	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	cfbd8a342a	Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	62bf2296c6	Remove exposing kUseTrLoad as template parameter of pipeline problem	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	67f9461b42	Simplification in the cross_attention testing/benchmarking scripts	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	bd46155431	Remove max_target 3200 cases from cross_attention testing and benchmarking	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	f99ed6225b	Clarify the using the max_seqlen and max_seqlen_q	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	6f2a73b17d	Add scripts for testing/benchmarking cross_attention cases	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	8a7f0a8e99	Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	c0922a6cb8	Add implementation of fwd splitkv on no_softmax path	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	410f472a33	Remove dropout=true instances to reduce compiling-time	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	686125c0cd	Rename default_policy to policy for hstu_attention forward	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	27d01448d0	Move the calling of mask.GetTileRangeAlongX() to the kernel	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	f99d6c4112	Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	ac9a142e63	Enable run-time selection of MTile sizes according to the predicted CU utilization ratio	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	0462d44215	Update to support grouped mode hstu attention	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	4781582b0f	Using in-place version of block_tile_reduce() so that using of m_local is avoided	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	d206045c53	Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	57d837977b	Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	09656528a6	Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	d55d6a19a7	Update to hstu masking to separate the implementation for cross-attention and self-attention	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	3b674ee8c9	Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	d8f7e5a791	Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	7d110e3872	Add softmax selection to two of the testing scripts	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	1e199d0641	[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	24d6e49323	Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	ea9e4d8e00	Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	cf971db352	Fix in K-LdsBuffer and V-LdsBuffer over-lap checking	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	bcedecce4f	Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop	2026-06-23 09:27:59 +00:00
Qianfeng Zhang	500f2245f7	Tiny fix in using v_tiles[] index	2026-06-23 09:27:59 +00:00

1 2 3 4 5 ...

1331 Commits