composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-13 02:27:33 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	b75077475b	Remove useless codes in the two trload pipelines	2025-11-15 13:48:50 +00:00
Qianfeng Zhang	238b5c4f08	Separate Traits from Problem while being used for defining the pipeline	2025-11-14 16:42:06 +00:00
Qianfeng Zhang	95c1bb25e3	Remove the k_element_func and v_element_func from the pipeline since they are not used	2025-11-13 14:53:47 +00:00
Qianfeng Zhang	881ddc5741	Update to the two trload pipeline to load whole Q-tile once through LDS on mi350	2025-11-12 15:59:38 +00:00
Qianfeng Zhang	8f876f094e	Simplify the codes in block_gemm_areg_bsmem_creg_v2_hack_1	2025-11-10 16:11:26 +00:00
Qianfeng Zhang	303818a851	Simplify the codes in block_gemm_areg_bsmem_trload_creg	2025-11-10 15:27:34 +00:00
Qianfeng Zhang	bd0444f365	[Performance] Change the tile settings for mi350/trload no_softmax pipeline to enable to use mfma-16x16x32 for Gemm-1	2025-11-06 08:20:11 +00:00
Qianfeng Zhang	54cd431f16	Improve the softmax+trload pipeline by using kN0=64 and prefetch only two k tiles	2025-11-05 16:23:05 +00:00
Qianfeng Zhang	d190af2ef5	Tiny fix in trload with_softmax/no_softmax pipeline	2025-11-05 14:44:13 +00:00
Qianfeng Zhang	99993acca4	Improve both the with_softmax and no_softmax pipelines	2025-11-04 15:46:49 +00:00
Qianfeng Zhang	bc22b83b19	Add kUseTrLoad = false in non-trload pipeline	2025-11-03 12:40:16 +00:00
Qianfeng Zhang	e40ab20b9e	Clarifying the using of CK_TILE_HOST and CK_TILE_HOST_DEVICE trying to save compiling time	2025-11-03 09:40:03 +00:00
Qianfeng Zhang	e31829384d	Change in updating max_uih_seqlen in the example	2025-11-02 03:59:46 +00:00
Qianfeng Zhang	39cb8c33d1	Use supplement_array_by_last_element(num_targets, ) in example	2025-11-02 03:34:54 +00:00
Qianfeng Zhang	80e08b6efe	Use supplement_array_by_last_element() in example to simplify the codes	2025-11-02 03:17:16 +00:00
Qianfeng Zhang	10133e5d51	Update to README.md	2025-11-02 03:16:48 +00:00
Qianfeng Zhang	8408ec0a02	Add scripts for testing the using of separate sequence lengths for k/v	2025-11-02 03:16:22 +00:00
Qianfeng Zhang	17e404be3b	Support separate sequence lengths for q and kv	2025-11-02 03:14:53 +00:00
Qianfeng Zhang	eaf9650fed	Use separate pipelines for using or not-using softmax situations	2025-10-30 10:01:52 +00:00
Qianfeng Zhang	207e6f10b8	Implementation of hstu attention pipeline using trload for v on mi350	2025-10-29 15:45:14 +00:00
Qianfeng Zhang	a464269bb6	Fix in the comments	2025-10-27 10:47:40 +00:00
Qianfeng Zhang	4eeb5cc917	Update to gemm_0's CBlockDistribution encoding so that it is compatible with gemm_1's ABlockDistribution encoding	2025-10-27 10:47:23 +00:00
Qianfeng Zhang	98a241a2eb	Using separate tile settings for no-softmax and with-softmax hstu attention situations	2025-10-24 01:47:55 +00:00
Qianfeng Zhang	7c4012266a	Update to benchmark scripts to consider for using softmax	2025-10-23 10:09:37 +00:00
Qianfeng Zhang	d1505786f8	Add support of softmax in hstu attention	2025-10-20 14:26:55 +00:00
Qianfeng Zhang	a874839dc2	Add template parameter to gemm_0 MakeCBlockTile() for the need of defining PcompBlockTileType	2025-10-20 14:26:29 +00:00
Qianfeng Zhang	1a8f2f21fb	Move scaling by attn_scale to inside the main-loop	2025-10-20 14:22:18 +00:00
Qianfeng Zhang	bbda3f6f1c	Let IsTokenPairInsideMask() return bool type	2025-10-20 14:21:26 +00:00
Qianfeng Zhang	fdb89d3e2f	Add instances to consider for adding softmax support	2025-10-20 14:20:54 +00:00
Qianfeng Zhang	2072e53d1e	Remove K0 from tile setting since it is not used	2025-10-14 07:15:26 +00:00
Qianfeng Zhang	22a7b31865	Change to pipeline so that it is easier to add support of using softmax	2025-10-12 06:09:55 +00:00
Qianfeng Zhang	d308b09fae	Remove using IGLP method for instruction scheduling for kUseLocal true path	2025-10-12 06:09:25 +00:00
Qianfeng Zhang	6b40ce4074	Fix in GetQKBlockGemm()	2025-09-27 14:59:32 +00:00
Qianfeng Zhang	27b96b15c4	Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm	2025-09-25 15:38:55 +00:00
Qianfeng Zhang	bd32cc0de0	Remove useless constant statement in the kernel	2025-09-19 07:24:29 +00:00
Qianfeng Zhang	db62a9f47e	Remove un-necessary HSTU_CHECK() callings	2025-09-13 16:39:14 +00:00
Qianfeng Zhang	2427426640	Add HSTU_CHECK() and use it in example codes	2025-09-13 16:38:33 +00:00
Qianfeng Zhang	a5b7360862	Smalle update in reference hstu attention	2025-09-13 06:53:54 +00:00
Qianfeng Zhang	798fc3cc0b	Detach HstuBlockMask from pipeline definition and construct the HstuBlockMask type in the kernel according to window_size	2025-09-12 09:11:47 +00:00
Qianfeng Zhang	7d10353fda	Unify the license statements on all the source files	2025-09-11 10:27:00 +00:00
Qianfeng Zhang	1c030e8c3c	Remove using MakeKargsImpl() to simplify the hstu kernel	2025-09-10 15:28:12 +00:00
Qianfeng Zhang	72eb4e95d8	Clarify the using of kSubQKHeaddim and kQKHeaddim so that less regular hdim (eg. 96, 160) can be efficiently supported	2025-09-09 12:55:01 +00:00
Qianfeng Zhang	f8dea2bc86	Use set_slice_tilie() to replace direct thread_buffer assignment	2025-09-09 12:54:32 +00:00
Qianfeng Zhang	4bf65d9fe5	Merge branch 'develop' into hstu_attention_mi350_fwd_bwd and change in using ck_tile::make_kernel	2025-09-01 07:35:35 +00:00
Aviral Goel	fcff0043ae	chore(gemm): clang format to pass CI (#2758 )	2025-08-29 00:38:46 -07:00
Vijay Krish	4208e28988	ck_tile kernel for gemm with groupwise quantized B tensor. (#2663 ) * This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. fp8, fp8 -> f32 bf8, bf8 -> f32 fp8, i4 -> f32 bf8, i4 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. * Solve merge conflict * [CK TILE] Update CHANGELOG.md --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Cong Ma <congma13@amd.com>	2025-08-28 23:43:02 -07:00
Cong Ma	428090f749	Support transposed C tile in Aquant (#2679 ) The performance of Aquant has increased after enabling transposed C. Do not need to exchange AQ elements among lanes after enabling transposed C as one thread only holds data from one row.	2025-08-28 13:28:09 -07:00
asleepzzz	038ea82315	Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 )" (#2757 ) This reverts commit `ead4447b20`.	2025-08-28 22:50:42 +08:00
linqunAMD	4a49dac7c6	[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728 ) * fix copy basic build error * fix other ck tile test build error	2025-08-28 20:30:30 +08:00
Yi DING	ead4447b20	[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 ) * 16x192 * Use buffer_load_lds for lse/d * Dispatch & cleanup * Avoid zeroing dq & fix * fix	2025-08-28 18:54:18 +08:00

1 2 3 4 5 ...

874 Commits