composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	010b3f48b3	Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time" This reverts commit `2972de4c88`.	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	bce38c1531	Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	87b5aa78bd	Use kK1=16	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	bce88a9e73	Use type_convert rather than static_cast in f_silu	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	9c3e49a1d0	Add max_seqlen as divider in siLu	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	717aae7ce7	Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	380165c3dc	Override and fix GetAlignmentK()	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	34998cfd19	Use kN0=32	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	94f8d71ee2	Temporarily close the instance for hdim64 and hdim256 to save compiling time	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	9df0fad750	Hack block_gemm_areg_bsmem_creg_v2 to let s_acc for gemm_0 not need be cleared first	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	6bf4877a20	Adjust the v_tile and k_tile loading location	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	ba037426c5	Put two gemms call inside one n0loop unroll	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	23852ef4c0	Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	ddd9227453	Replace set_tile_if() by sweep_tile_span() to reduce branching	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	187f4d3f68	Update the GridSize() and GetTileIndex() in hstu kernel	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	9a08e1090e	Add scripts for measuring jagged with/no causal cases	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	c0128a9156	Tiny update in IsTokenPairInsideMask()	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	ac0e593e0d	Use compiler builtin directly in f_silu for float type	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	31c21c74d8	Code re-arrangement in pipeline	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	eb2564fe46	Update the seqlen_k_curr inside the first gemm loop	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	40683ee932	Rename the performance measurement scripts	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	79fdd564b8	Add support for WarpGem-16x16x32 in QK-BlockGemm (which enables using ds_write/read_b128 for K	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	1986d8c578	Update in K-Lds laying-out to consider for both WarpGemm-32x32x16 and WarpGemm-16x16x16	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	6dd83b2a5a	Use 16x16x16 WarpGemm	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	7153a99dd4	Using __builtin_amdgcn_rcpf in siLU function	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	fb89a013b7	Combine minus with scale_s	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	23e80a5964	Move silu calculation to gemm1 iteration and try to interleave gemm_1 and silu	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	f40d68c1a9	Update in using masking for the case where kMasking is false and kPadSeqLenK is true	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	95b9a277ac	Fix in generate_instances.py and re-generated the instances	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	e5fb03a4aa	Back to use exp() instead of exp2() since exp() in ck_tile using fast __builtin_amdgcn_exp2f()	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	266e7bc8e9	Use kN0=64 to save vgprs	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	8f7a97fe02	Fix the script name	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	bace12feac	Fix in GetTileRangeAlongX	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	7a7c17802a	Add script compare_with_triton_2.sh for measuring the jagged cases of seqlen 1024/2048/4096/8192/16384/32768	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	e731437af1	Change gemm0 to iterate along kN0 so that BlockGemm can overlap with maksing and siLu	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	8da21d9cde	Fix the GetTileRangeAlongX() to align with the hstu masking definition when both causal=true and local=true	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	065776d42d	Remove un-needed __builtin_amdgcn_sched_barrier(0)	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	63acd4638b	Use shared ring Lds buffers for K/V to avoid over-lapping between first-K/last-V or last-K/first-V	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	58090fe730	Tiny codes simplification in pipeline	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	ec14e9df3e	Remove one line of __builtin_amdgcn_sched_barrier(0)	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	c0609d49cd	Fix the integer overflow in total_flops calculation	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	1efb2a8f38	Add scripts for comparing with triton	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	70b4aa310f	Use exp2() to calculate exp() for better performance	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	45019fd5fd	Remove the comparing of row/col to max_uih_len in masking	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	ad10a2dd53	Use kM0=128 kN0=64 to completely remove the vgprs spilling	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	8b2948b31e	Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	fafb375122	Use packed cast_tile for fp16	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	6686c7af44	Update to partially reduce the register spilling	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	459c5565d4	Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier()	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	8a6c2591b0	Update the in pipeline codes	2026-06-23 09:17:26 +00:00

1 2 3 4 5 ...

1122 Commits