composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	798fd3cd8b	Enable the kernel dispatching path from is_training & use_softmax to kStoreLSE	2026-06-05 15:51:39 +00:00
Qianfeng Zhang	8b62d651a4	Add instances and kStoreLSE template in dispatcher class to support outputting lse for fwd training	2026-06-05 15:51:39 +00:00
Qianfeng Zhang	c1e3b9be6a	Set lse tensor dim strides in example	2026-06-05 15:51:39 +00:00
Qianfeng Zhang	0fdca0e940	Replace template kUseSoftmax/kStoreLSE by boolean parameters in reference fwd codes to save compiling time	2026-06-05 15:51:39 +00:00
Qianfeng Zhang	8f091efc4c	Add support for outputing lse in the example and reference hstu attention forward implementation	2026-06-05 15:51:39 +00:00
Qianfeng Zhang	cc184fc202	Add support for preparing lse_dram_window in hstu fwd kernel	2026-06-05 15:51:39 +00:00
Qianfeng Zhang	75a3b5aab0	Kernel use types declared in the problem rather than the pipeline	2026-06-05 15:51:39 +00:00
Qianfeng Zhang	696d4534b2	Tiny simplification with defining the Bias related Kargs	2026-06-03 09:55:04 +00:00
Qianfeng Zhang	eba3c2f635	Add parameters used by storing lse in the fwd and fwd_splitkv_combine kernel to prepare for supporting training	2026-06-03 09:55:04 +00:00
Qianfeng Zhang	5ee8a37cd3	Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct	2026-06-03 09:55:04 +00:00
Qianfeng Zhang	36dd77fb16	Add kStoreLSE template parameter to the problems	2026-06-03 09:55:04 +00:00
Qianfeng Zhang	333abddbae	Rename the reference interfaces and the files	2026-05-28 08:07:54 +00:00
Qianfeng Zhang	e841981ddd	Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp	2026-05-23 07:24:00 +00:00
Qianfeng Zhang	1dbd127d1b	Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf)	2026-05-23 04:37:52 +00:00
Qianfeng Zhang	86d8d72008	Use partition_index parameter for all get_x_indices_from_distributed_indices() calls	2026-05-22 15:08:30 +00:00
Qianfeng Zhang	65992be728	Update to the cross_attention test/bench scripts	2026-05-22 14:44:14 +00:00
Qianfeng Zhang	b1052e87e1	Add implementation of hstu fwd splitkv for softmax path	2026-05-22 09:54:27 +00:00
Qianfeng Zhang	8a7529177d	Fix the calling context for type_context in scale_tile_in_scalar()/scale_tile_in_pack	2026-05-11 08:53:07 +00:00
Qianfeng Zhang	0a32eddc0a	Re-format the .hpp/.cpp files using clang-format-18	2026-05-10 13:46:02 +00:00
Qianfeng Zhang	6981f148ee	Fix potential bug in kernel host interface BlockSize()	2026-05-08 06:27:19 -04:00
Qianfeng Zhang	250f325c3a	More consideration in MakeOaccDramTileDistribution() in splitkv_combine pipeline policy	2026-05-07 06:32:01 -04:00
Qianfeng Zhang	888b6cad86	Use inline-assembly based v_pk_mul_f32 to scale tile pcomp_tile in non-softmax pipeline on gfx950	2026-04-30 14:06:06 +00:00
Qianfeng Zhang	4c583f0574	Add -fno-slp-vectorize option for building hstu kernels on gfx950	2026-04-30 13:37:22 +00:00
Qianfeng Zhang	7883f52d9f	Use include <...> format to refer to header files from ck_tile	2026-04-27 10:15:14 +00:00
Qianfeng Zhang	d0803f263d	Mark low probability branch as unlikely in the softmax pipelines	2026-04-27 07:24:03 +00:00
Qianfeng Zhang	b9d4be0982	Use type_convert to convert float constant to CompDataType	2026-04-24 15:46:26 +00:00
Qianfeng Zhang	1f2e2a272e	Implement conditional softmax rescale in trload with_softmax pipeline	2026-04-24 09:54:12 +00:00
Qianfeng Zhang	90e718f73d	Implement conditional softmax rescale in non-trload with_softmax pipeline	2026-04-24 09:22:44 +00:00
Qianfeng Zhang	d099819657	Renaming the test_hstu_attention_seqlen_kv.sh to test_hstu_cross_attention.sh	2026-04-24 07:36:41 +00:00
Qianfeng Zhang	0b6bbe45d6	Remove exposing kUseTrLoad as template parameter of pipeline problem	2026-04-21 15:35:03 +00:00
Qianfeng Zhang	8f0f7ca436	Simplification in the cross_attention testing/benchmarking scripts	2026-04-17 09:38:41 +00:00
Qianfeng Zhang	3f9f2fa736	Remove max_target 3200 cases from cross_attention testing and benchmarking	2026-04-17 09:17:38 +00:00
Qianfeng Zhang	db3263469c	Clarify the using the max_seqlen and max_seqlen_q	2026-04-17 09:13:45 +00:00
Qianfeng Zhang	5c84f54fd9	Add scripts for testing/benchmarking cross_attention cases	2026-04-16 15:45:57 +00:00
Qianfeng Zhang	7889844d6b	Clarify the using of group_max_seqlens[] and group_input_max_uih_seqlens[] parameters for group attention example	2026-04-16 07:11:55 +00:00
Qianfeng Zhang	9279af33f1	Add implementation of fwd splitkv on no_softmax path	2026-04-16 07:11:06 +00:00
Qianfeng Zhang	a95f64601d	Remove dropout=true instances to reduce compiling-time	2026-04-07 09:38:18 +00:00
Qianfeng Zhang	348c3e05be	Rename default_policy to policy for hstu_attention forward	2026-04-07 08:41:58 +00:00
Qianfeng Zhang	423cc72bc4	Move the calling of mask.GetTileRangeAlongX() to the kernel	2026-03-28 14:29:17 +00:00
Qianfeng Zhang	eefe426ef7	Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()	2026-03-26 14:40:01 +00:00
Qianfeng Zhang	76da618c85	Enable run-time selection of MTile sizes according to the predicted CU utilization ratio	2026-03-20 05:36:44 +00:00
Qianfeng Zhang	302537c5a8	Update to support grouped mode hstu attention	2026-03-09 16:15:58 +00:00
Qianfeng Zhang	73d6e0eb67	Using in-place version of block_tile_reduce() so that using of m_local is avoided	2026-03-05 16:27:41 +00:00
Qianfeng Zhang	2be2c3cd11	Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()	2026-02-21 14:59:00 +00:00
Qianfeng Zhang	f2a555dac7	Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts	2026-02-09 15:55:13 +00:00
Qianfeng Zhang	6f8b9548b5	Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument	2026-02-09 09:02:17 +00:00
Qianfeng Zhang	bdfa0a74c2	Update to hstu masking to separate the implementation for cross-attention and self-attention	2026-02-08 16:00:47 +00:00
Qianfeng Zhang	0711f4f90a	Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention	2026-02-06 15:40:07 +00:00
Qianfeng Zhang	d169ed2194	Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950	2026-02-05 15:57:18 +00:00
Qianfeng Zhang	8af5e26717	Add softmax selection to two of the testing scripts	2026-02-05 15:27:15 +00:00

1 2 3 4 5 ...

1113 Commits